Contents

1 Introduction

Single-cell RNA-sequencing (scRNA-seq) enables the study of genome-wide transcriptional heterogeneity in cell populations that remains otherwise undetected in bulk experiments (Stegle, Teichmann, and Marioni 2015; Prakadan, Shalek, and Weitz 2017; Patange, Girvan, and Larson 2018). On the broadest level, this heterogeneity can reflect the presence of distinct cell subtypes or states. Alternatively, cell-to-cell expression heterogeneity can be due to gradual changes along processes that evolve over time, such as development and differentiation. Several clustering and pseudotime inference methods have been developed to characterise these sources of heterogeneity (Kiselev, Andrews, and Hemberg 2019; Saelens et al. 2019). However, there is a limited availability of computational tools tailored to study more subtle variability within a seemingly homogeneous cell population. This variability can be due to deterministic or stochastic events that regulate gene expression and, among others, has been reported to increase prior to cell fate decisions (Mojtahedi et al. 2016) as well as throughout ageing (Martinez-Jimenez et al. 2017).

This article complements existing scRNA-seq workflows based on the Bioconductor package ecosystem (e.g. (Lun, McCarthy, and Marioni 2016; Kim, Lee, and Kim 2019)). We describe a step-by-step analysis which uses scater and scran to perform quality control (QC) as well as initial exploratory analyses (McCarthy et al. 2017; Lun, McCarthy, and Marioni 2016). To robustly quantify transcriptional variability within and between pre-specified cell populations (such as experimental conditions or cell types) we use BASiCS (C. A. Vallejos, Marioni, and Richardson 2015a; Vallejos, Richardson, and Marioni 2016; Eling et al. 2017) — a Bayesian hierarchical framework that simultaneously performs data normalisation (global scaling), technical noise quantification and selected downstream analyses, whilst propagating statistical uncertainty across these steps. Among others, BASiCS, has led to new insights about the heterogeneity of immune cells (Martinez-Jimenez et al. 2017).

Within a population of cells, BASiCS decomposes the total observed variability in expression measurements into technical and biological components (C. A. Vallejos, Marioni, and Richardson 2015a). This enables the identification of highly variable genes (HVGs) that capture the major sources of heterogeneity within the analysed cells (Brennecke et al. 2013). HVG detection is often used as feature selection, to identify the input set of genes for subsequent analyses. BASiCS can also highlight lowly variable genes (LVGs) that exhibit stable expression across the population of cells. These may relate to essential cellular functions and can assist the development of new data normalisation or integration strategies (Lin et al. 2019).

In order to compare expression patterns between two or more pre-specified groups of cells, BASiCS provides a probabilistic decision rule to perform differential expression analyses (Vallejos, Richardson, and Marioni 2016; Eling et al. 2018). Whilst several differential expression tools have been proposed for scRNA-seq data (e.g. (Kharchenko, Silberstein, and Scadden 2014; Finak et al. 2015)), some evidence suggests that these do not generally outperform popular bulk RNA-seq tools (Soneson and Robinson 2018). Moreover, most of these methods are only designed to uncover changes in overall expression, ignoring the more complex patterns that can arise at the single cell level (Lähnemann et al. 2020). Instead, BASiCS embraces the high granularity of scRNA-seq data, uncovering changes in cell-to-cell expression variability that are not confounded by differences in technical noise or in overall expression.

In this manuscript, we briefly discuss the sources of variability that arise in scRNA-seq data and some of the strategies that have been designed to control or attenuate technical noise in these assays. We also summarise the main features of the Bioconductor packages that are used throughout this workflow and provide a description for the underlying statistical model implemented in BASiCS. This includes practical guidance to assess the convergence of the Markov Chain Monte Carlo (MCMC) algorithm that is used to infer model parameters as well as recommendations to interpret and to post-process the model outputs. Finally, we provide two step-by-step case studies using naive CD4+ T cells (Martinez-Jimenez et al. 2017) and samples collected during embryonic somitogenesis (Ibarra-Soria et al. 2018). These examples were chosen to illustate the usage of BASiCS in the context of different scRNA-seq protocols.

All source code used to generate the results presented in this article is available at [https://github.com/VallejosGroup/BASiCSWorkflow]. To ensure the reproducibility of this workflow, the analysis environment and all software dependencies are provided as a Docker (Boettiger 2015) image at: [https://hub.docker.com/repository/docker/alanocallaghan/bocker].

2 Sources of variability in scRNA-seq data

The focus of this article is to use scRNA-seq data in order to quantify the strength of cell-to-cell expression heterogeneity within seemingly homogeneous cell populations. Here, we briefly describe the underlying sources of heterogeneity that can be captured by cell-to-cell variability estimates derived from scRNA-seq data.

Stochastic variability within a cell population — often referred to as transcriptional noise — can arise from intrinsic and extrinsic sources (Elowitz et al. 2002; Eling, Morgan, and Marioni 2019). Classically, extrinsic noise is defined as stochastic fluctuations induced by different dynamic cellular states (e.g. cell cycle, metabolism, intra- and inter-cellular signalling) (Zopf et al. 2013; Iwamoto, Shindo, and Takahashi 2016; Kiviet et al. 2014). In contrast, intrinsic noise arises from stochastic effects on biochemical processes such as transcription and translation (Elowitz et al. 2002). Intrinsic noise can be modulated by genetic and epigenetic modifications (such as mutations, histone modifications, CpG island length and nucleosome positioning) (Eberwine and Kim 2015; Faure, Schmiedel, and Lehner 2017; Morgan and Marioni 2018) and is usually measured at the gene level (Elowitz et al. 2002). Cell-to-cell gene expression variability estimates derived from scRNA-seq data capture a combination of these effects, as well as deterministic regulatory mechanisms (Eling, Morgan, and Marioni 2019). Moreover, these variability estimates can also be inflated by the technical noise that is typically observed in scRNA-seq data (Brennecke et al. 2013).

Different strategies have been incorporated into scRNA-seq protocols to control or attenuate technical noise. For example, external RNA spike-in molecules (such as the set introduced by the External RNA Controls Consortium, ERCC (External RNA Controls Consortium 2005)) can be added to each cell’s lysate in a (theoretically) known fixed quantity. Spike-ins can assist quality control steps (McCarthy et al. 2017), data normalisation (Vallejos et al. 2017) and can be used to infer technical noise (Brennecke et al. 2013). Another strategy is to tag individual cDNA molecules using unique molecular identifiers (UMIs) before PCR amplification (Islam et al. 2014). Reads that contain the same UMI can be collapsed into a single molecule count, attenuating technical variability associated to cell-to-cell differences in amplification and sequencing depth (these technical biases are not fully removed unless sequencing to saturation (Vallejos et al. 2017)). However, despite the benefits associated to the use of spike-ins and UMIs, these are not available for all scRNA-seq protocols (Haque et al. 2017).

3 Methods

This step-by-step scRNA-seq workflow is primarily based on the Bioconductor package ecosystem (Amezquita et al. 2019). A graphical overview is provided in Figure 1 and its main components are described below.

Graphical overview for the scRNA-seq analysis workflow described in this manuscript. Starting from a SingleCellExperiment object, we use the scater and scran Bioconductor packages to perform quality control and initial exploratory analyses. The primary focus of this workflow is to robustly quantify transcriptional heterogeneity within seemingly homogeneous cell populations. For this purpose, we apply the BASiCS Bioconductor package, illustrating how BASiCS can be used to analyse a single or multiple pre-specified groups of cells.

Figure 1: Graphical overview for the scRNA-seq analysis workflow described in this manuscript
Starting from a SingleCellExperiment object, we use the scater and scran Bioconductor packages to perform quality control and initial exploratory analyses. The primary focus of this workflow is to robustly quantify transcriptional heterogeneity within seemingly homogeneous cell populations. For this purpose, we apply the BASiCS Bioconductor package, illustrating how BASiCS can be used to analyse a single or multiple pre-specified groups of cells.

3.1 Input data

library("SingleCellExperiment")

Here, we use the SingleCellExperiment package to convert an input matrix of raw read-counts (molecule counts for UMI-based protocols) into a SingleCellExperiment object which can store scRNA-seq data and its associated metadata, such as gene- and cell-specific information. Moreover, when available, the same object can be used to store read-counts for technical spike-in molecules (these can be accessed via the altExp() method). A major advantage of using a SingleCellExperiment object as the input for scRNA-seq analyses is that it enables interoperability across a large number of Bioconductor packages (Amezquita et al. 2019).

3.2 Quality control and exploratory analysis

library("scater")
library("scran")

An critical step in scRNA-seq analyses is to apply QC diagnostics, removing low quality samples that may distort downstream analyses. Among others, QC can help to identify samples that contain broken cells, that are empty or that contain multiple cells (Ilicic et al. 2016). Moreover, lowly expressed genes for which less reliable information is available are typically also removed. The OSCA online book each provide an extensive overview on important aspects of how to perform QC of scRNA-seq data, including standard exploratory analyses (Lun, McCarthy, and Marioni 2016; Amezquita et al. 2019).

To perform QC, we use the scater package (McCarthy et al. 2017). The addPerCellQC and addPerFeatureQC functions are applied to calculate standard QC metrics for each cell (e.g. percentage of mitochondrial reads) and gene (e.g. percentage of zeroes across all cells), respectively. The package also provides a suite of visualisation tools that can be used to explore the data under study and its associated QC diagnostic metrics.

The scran package offers additional tools for QC diagnostics and a variety of functions scRNA-seq data analysis (Lun, McCarthy, and Marioni 2016). It can perform global scaling data normalisation, calculating cell-specific scaling factors that capture global differences in read-counts across cells (e.g. due to sequencing depth and PCR amplification) (Lun, Bach, and Marioni 2016). In scran, users can also explore how the observed variability in expression counts can be decomposed into technical and biological sources (see decomposeVar). Moreover, the trendVar function can be used to infer an overall trend between gene-specific mean and variance estimates. To derive gene-specific variability estimates that are not confounded by this overall trend, the DM function calculates the distance between gene-specific squared coefficients of variation (CV\(^2\)) and a rolling median along the range of mean expression values (Kolodziejczyk et al. 2015). DM estimates enable exploratory analyses of cell-to-cell heterogeneity, but a measure of uncertainty is not readily available. As such, gene-specific downstream inference (e.g. differential variability testing) is precluded.

3.3 Analysis of cell-to-cell transcriptional variability

library("BASiCS")

The BASiCS package uses a Bayesian hierarchical framework which borrows information across all genes and cells to robustly quantify transcriptional variability (C. A. Vallejos, Marioni, and Richardson 2015b). Similar to the approach adopted in scran, BASiCS infers cell-specific global scaling normalisation parameters. However, instead of inferring these as a pre-processing step, BASiCS uses an integrated approach in which data normalisation and downstream analyses are performed simultaneously, thereby propagating statistical uncertainty. To quantify technical noise, the original implementation of BASiCS uses information from extrinsic spike-in molecules as control features, but the model has been extended to address situations in which spike-ins are not available (Eling et al. 2018).

BASiCS summarises expression patterns through gene-specific mean (\(\mu_i\)) and over-dispersion (\(\delta_i\)) parameters. Mean parameters \(\mu_i\) quantify the overall expression for each gene \(i\) across the population of cells under study. Instead, \(\delta_i\) captures the excess of variability that is observed with respect to what would be expected in a homogeneous cell population, after taking into account technical noise. This is used as a proxy to quantify transcriptional variability. Moreover, to account for the strong association that is typically observed between mean expression and over-dispersion estimates, we recently introduced gene-specific residual over-dispersion parameters \(\epsilon_i\) (Eling et al. 2018). Similar to DM values implemented in scran, these are defined as deviations with respect to an overall regression trend that captures the relationship between mean and over-dispersion values.

Parameter inference is implemented in the BASiCS_MCMC function using an adaptive Metropolis within Gibbs algorithm (Roberts and Rosenthal 2009), whose convergence can be assessed using BASiCS_EffectiveSize and BASiCS_DiagPlot. The output from BASiCS_MCMC is a BASiCS_Chain object, which can be used for further downstream analyses. In particular, BASiCS_DetectHVG and BASiCS_DetectLVG can be respectively used to identify highly and lowly variable genes within a cell population. Moreover, BASiCS_TestDE is used to perform differential mean and variability analyses between groups of cells.

3.4 Gene annotation

The molecule counts used in this workflow are annotated using Ensembl gene identifiers. In order to facilitate interpretation, it is often useful to generate a mapping from Ensembl gene IDs to gene symbols using In order to perform GO enrichment analysis, and to aid in the biological interpretation of individual genes, we need to (i) link each gene’s ID to its symbol and (ii) calculate each gene’s length. For the first task, the biomaRt Bioconductor package annotates a wide range of gene and gene product identifiers (Durinck et al. 2005) by accessing the BioMart software suite (http://www.biomart.org) using the biomaRt (Durinck et al. 2009).

For the second task, and to perform GO enrichment analysis, we further collected the gene length for each gene. This can be extracted from the EnsDb.Mmusculus.v79 annotation package. This resource offers gene annotations such as exon positions and promoters for the Ensembl database.

dir.create("rds", showWarnings = FALSE)
dir.create("downloads", showWarnings = FALSE)

if (!file.exists("rds/genenames.rds")) {
  # Initialize mart and dataset
  ensembl <- useMart(
    biomart = "ensembl",
    dataset = "mmusculus_gene_ensembl"
  )

  # Select gene ID and gene name
  genenames <- getBM(
    attributes = c("ensembl_gene_id", "external_gene_name"),
    mart = ensembl
  )

  rownames(genenames) <- genenames$ensembl_gene_id
  saveRDS(genenames, "rds/genenames.rds")
} else {
  genenames <- readRDS("rds/genenames.rds")
}
if (!file.exists("rds/genelength.rds")) {
  # Build exon list
  exons_list <- exonsBy(EnsDb.Mmusculus.v79, by = "gene")

  # Calculate summed length of all exons
  genelength <- vapply(
    exons_list,
    function(x) {
      sum(width(reduce(x)))
    },
    numeric(1)
  )
  saveRDS(genelength, "rds/genelength.rds")  
}
genelength <- readRDS("rds/genelength.rds")

# Add gene length to gene names
genenames$gene_length <- genelength[genenames$ensembl_gene_id]

4 Reproducibility

The following software versions were used throughout this workflow:

A Docker image containing all software requirements is available at [https://hub.docker.com/repository/docker/alanocallaghan/bocker].

5 C1 Fluidigm data: Analysis of naive CD4+ T cells

For the first case study, we use scRNA-seq data generated for CD4+ T cells using the C1 Single-Cell Auto Prep System (Fluidigm®). Martinez-Jimenez et al. profiled naive (hereafter also referred to as unstimulated) and activated (3 hours using in vitro antibody stimulation) CD4+ T cells from young and old animals across two mouse strains to study changes in expression variability during ageing and upon immune activation (Martinez-Jimenez et al. 2017). They extracted naive or effector memory CD4+ T cells from spleens of young or old animals, obtaining purified populations using either magnetic-activated cell sorting (MACS) or fluorescence activated cell sorting (FACS). External ERCC spike-in RNA (External RNA Controls Consortium 2005) was added to aid the quantification of technical variability across all cells and all experiments were performed in replicates (also referred to as batches) to control for batch effects.

5.1 Obtaining the data

The matrix with raw read counts can be obtained from ArrayExpress under the accession number E-MTAB-4888. In the matrix, column names contain library identifiers and row names display gene Ensembl identifiers.

if (!file.exists("downloads/raw_data.txt")) {
  # Download raw counts file
  website <- "https://www.ebi.ac.uk/"
  folder <- "arrayexpress/files/E-MTAB-4888/"
  file <- "E-MTAB-4888.processed.1.zip"
  destfile <- "downloads/raw_data.txt.zip"
  download.file(
    paste0(website, folder, file),
    destfile = destfile
  )
  unzip("downloads/raw_data.txt.zip", exdir = "downloads")
  file.remove("downloads/raw_data.txt.zip")
}

# Read in raw data
CD4_raw <- read.table("downloads/raw_data.txt", header = TRUE, sep = "\t")
CD4_raw <- as.matrix(CD4_raw)

The input matrix contains data for1513 cells and 31181 genes (including sum(grepl("ERCC", rownames(CD4_raw))) ERCC spike-ins). Information about experimental conditions and batches is available in a metadata file under the same accession number.

if (!file.exists("downloads/metadata_file.txt")) {
  # Download raw counts file
  website <- "https://www.ebi.ac.uk/"
  folder <- "arrayexpress/files/E-MTAB-4888/"
  file <- "E-MTAB-4888.additional.1.zip"
  destfile <- "downloads/metadata.txt.zip"
  download.file(
    paste0(website, folder, file),
    destfile = destfile
  )
  unzip("downloads/metadata.txt.zip", exdir = "downloads")
  file.remove("downloads/metadata.txt.zip")
}
# Read in metadata file
CD4_metadata <- read.table("downloads/metadata_file.txt", 
                           header = TRUE, sep = "\t")

# Save library identifier as rownames
rownames(CD4_metadata) <- CD4_metadata$X

# Show metadata entries
names(CD4_metadata)
## [1] "X"           "Strain"      "Age"         "Stimulus"    "Individuals"
## [6] "Celltype"

The metadata contains library identifiers (X), strain information (Strain; Mus musculus castaneus or Mus musculus domesticus), the age of the animals (Age; young or old), stimulation state of the cells (Stimulus; naive or activated), batch information (Individuals; associated to different mice), and cell type information (Celltype; via FACS or MACS purification).

The data and metadata described above is then converted into a SingleCellExperiment object.

# Separate intrinsic from ERCC counts
bio_counts <- CD4_raw[!grepl("ERCC", rownames(CD4_raw)), ]
spike_counts <- CD4_raw[grepl("ERCC", rownames(CD4_raw)), ]
# Generate the SingleCellExperiment object
sce_CD4_all <- SingleCellExperiment(
  assays = list(counts = as.matrix(bio_counts)),
  colData = CD4_metadata[colnames(CD4_raw), ]
)
# Add read-counts for spike-ins 
altExp(sce_CD4_all, "spike-ins") <- SummarizedExperiment(
  assays = list(counts = spike_counts)
)
sce_CD4_all
## class: SingleCellExperiment 
## dim: 31089 1513 
## metadata(0):
## assays(1): counts
## rownames(31089): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000106668 ENSMUSG00000106670
## rowData names(0):
## colnames(1513): do4737 do4739 ... do12254 do12257
## colData names(6): X Strain ... Individuals Celltype
## reducedDimNames(0):
## altExpNames(1): spike-ins

Throughout our analysis, we focus on naive and activated CD4+ T cells obtained from young Mus musculus domesticus (hereafter also referred to as black 6; B6) animals, purified using MACS-based cell sorting. The following code is used to extract these samples from the full dataset.

ind_select <- sce_CD4_all$Strain == "Mus musculus domesticus" &
  sce_CD4_all$Age == "Young" &
  sce_CD4_all$Celltype == "MACS-purified Naive"
sce_naive_active <- sce_CD4_all[, ind_select]
sce_naive_active
## class: SingleCellExperiment 
## dim: 31089 146 
## metadata(0):
## assays(1): counts
## rownames(31089): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000106668 ENSMUSG00000106670
## rowData names(0):
## colnames(146): do6113 do6118 ... do6493 do6495
## colData names(6): X Strain ... Individuals Celltype
## reducedDimNames(0):
## altExpNames(1): spike-ins

5.2 QC and exploratory analysis

5.2.0.1 Cell-level QC.

The data available at E-MTAB-4888 has been already filtered to remove poor quality samples. The QC applied in (Martinez-Jimenez et al. 2017) removed cells with: (i) fewer than 1,000,000 total reads, (ii) less than 20% of reads mapped to endogenous genes, (iii) less than 1,250 or more than 3,000 detected genes and (iv) more than 10% or less than 0.5% reads mapping to mitochondrial genes. Here, we visualise some of these metrics.

# Calculate per cell QC metrics
sce_naive_active <- addPerCellQC(sce_naive_active, use_altexps = TRUE)
plotColData(sce_naive_active, x = "sum", y = "detected") +
  xlab("Number of mapped exonic reads") +
  ylab("Number of detected genes")

min_reads <- 600000
max_reads <- 2300000
plotColData(sce_naive_active, x = "sum", y = "altexps_spike-ins_sum") +
  xlab("Total number of reads (excludes non-mapped and intronic reads)") +
  ylab("Total number of spike-in reads")

Another widely used QC diagnostic plot is to compare the total number (or fraction) of spike-in counts versus the total number (or fraction) of endogeneous counts (Figure ??). In such a plot, low quality samples are characterised by a high fraction of spike-in counts and a low fraction of endogeneous counts.

These QC metrics can also be visualised with respect to cell-level metadata. For example, the following figures show how these metrics relate to the experimental conditions (active vs unstimulated) and to the different mice from which cells were collected.

p_stimulus <- plotColData(sce_naive_active, x = "sum", y = "detected", 
                          colour_by = "Stimulus") +
  xlab("Number of mapped exonic reads") +
  ylab("Number of detected genes") +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Experimental conditions")

p_batch <- plotColData(sce_naive_active, x = "sum", y = "detected", 
                       colour_by = "Individuals") +
  xlab("Number of mapped exonic reads") +
  ylab("Number of detected genes") +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Source animals (batches)")

multiplot(p_stimulus, p_batch, cols = 2)

# Normalisation
sce_naive_active <- computeSumFactors(sce_naive_active)
sce_naive_active <- logNormCounts(sce_naive_active)
# Calculate PCA
sce_naive_active <- runPCA(sce_naive_active)

The Biocpkg("SimpleSingleCell") Bioconductor workflow and the OSCA online book each provide an extensive overview on important aspects of how to perform low-level analysis of scRNA-seq data, including quality control (Lun, McCarthy, and Marioni 2016; Amezquita et al. 2019). The calculateQCMetrics function of the scater package can be used to calculate a number of cell- and gene-specific quality metrics. Furthermore, scater offers the runPCA function to perform principal component analysis (PCA) across all cells.

# Calculate quality metrics
sce_naive_active <- addPerCellQC(sce_naive_active, use_altexps = TRUE)
sce_naive_active <- addPerFeatureQC(sce_naive_active)

clean_colnames <- make.names(colnames(colData(sce_naive_active)))
colnames(colData(sce_naive_active)) <- clean_colnames

The activation status and batch information for each individual cell of the selected dataset can be seen in Figure 2.

# Visualise the conditions and batch structure
p_stimulus <- scater::plotPCA(sce_naive_active, colour_by = "Stimulus")
p_batch <- scater::plotPCA(sce_naive_active, colour_by = "Individuals")
multiplot(p_stimulus, p_batch, cols = 2)
Visualisation of the cell stimulus (left) and the batch information per cell (right). Cells from two animals were captured either in the naive or activated state.

Figure 2: Visualisation of the cell stimulus (left) and the batch information per cell (right)
Cells from two animals were captured either in the naive or activated state.

The cells split into two conditions: naive and activated CD4+ T cells. Furthermore, we detect neither obvious batch effects nor outlier cells.

In the next step, we will visualise selected cell-specific quality metrics overlayed on the principal components. Figure @ref(fig:no-genes_total-counts) highlights the number of endogeneous genes detected per cell and the total number of counts. Both of these metrics can be used to detect empty wells or broken cells (low values) and possible doublets (high values).

# Visualise number of endogeneous genes detected
p_total_features <- scater::plotPCA(
    sce_naive_active, 
    colour_by = "detected"
  ) +
  scale_fill_viridis_c(name = "Number of genes", trans = "log10")

# Visualise log10-transformed total number of counts
p_total_counts <- scater::plotPCA(
    sce_naive_active, 
    colour_by = "sum"
  ) +
  scale_fill_viridis_c(name = "Number of counts", trans = "log10")
multiplot(p_total_features, p_total_counts, cols = 2)
Visualisation of the number of biological genes detected per cell (left) and the total number of reads per cell (right).

Figure 3: Visualisation of the number of biological genes detected per cell (left) and the total number of reads per cell (right)

We detect a higher number of expressed genes and a higher total read count in activated cells. Furthermore, cells within each group show a wide distribution of these quality measures, without clear outlying cells.

5.2.0.2 Gene-level QC.

Following cell-specific quality control, we will remove all genes that are not detected in at least 6 cells across both conditions. Furthermore, we remove genes that are not expressed with an average count of 1 across all cells. These thresholds need to be set specifically for each dataset, and careful gene-specific quality metrics need to be closely examined as suggested by the SimpleSingleCell and OSCA Bioconductor workflows (Lun, McCarthy, and Marioni 2016, @Amezquita2019).

# Remove genes
ind_expressed <- rowSums(counts(sce_naive_active) > 0) >= 5 &
  rowMeans(counts(sce_naive_active)) >= 1
sce_naive_active <- sce_naive_active[ind_expressed, ]

5.3 Calculating the molecule count for spike-in genes

BASiCS requires the absolute molecule count of the spike-in transcripts that were added to each well. To calculate the molecule count, we require the dilution and the concentration of the spike-in mix. For this, a table of the spike-in concentrations can be downloaded from https://www.thermofisher.com. The file contains the concentrations of 2 ERCC spike-in mixes.

For the CD4+ T cell data, the authors added a 1:50,000 dilution of the ERCC spike-in mix 1 (Martinez-Jimenez et al. 2017). We will first download the concentration information.

# Read in the spike-in concentrations
download_file <- function(file, website, folder, destfile = file) {
  if (!file.exists(destfile)) {
    download.file(paste0(website, folder, file), destfile)
  }
}
website <- "https://assets.thermofisher.com/"
folder <- "TFS-Assets/LSG/manuals/"
file <- "cms_095046.txt"
download_file(file, website, folder, "downloads/spike_info.txt")

ERCC_conc <- read.table(
  "downloads/spike_info.txt",
  sep = "\t", header = TRUE
)

The concentration is given in units of \(aM\mu{}l^{-1}\). We will first calculate the concentration in \(M\mu{}l^{-1}\).

# Moles per micro litre
ERCC_mmul <- ERCC_conc$concentration.in.Mix.1..attomoles.ul. * (10^(-18))

From this, we can calculate the molecule count per \(\mu{}l\) using the fact that 1 mole is comprised of \(6.02214076 \times 10^{23}\) molecules.

# Molecule count per micro litre
ERCC_countmul <- ERCC_mmul * (6.02214076 * (10^23))

During the preparation of the reaction mix, the authors diluted this mix by a factor of 1:50,000 (Martinez-Jimenez et al. 2017). The actual molecule number of spike-ins can therefore be calculated per \(\mu{}l\):

ERCC_count <- ERCC_countmul / 50000

The volume per well in the IFC chip is \(9nl\) https://www.fluidigm.com/faq/ifc-9. We therfore need to calculate the number of molecules in \(9nl\) reaction mix.

ERCC_count_final <- ERCC_count * 0.009

Depending on the technology used to capture and process single cells, as well as the dilution of the spike-in mix, the absolute number of spike-ins per reaction can change. The absolute amount of spike-ins per reaction is only used to calibrate the scale of the capture efficiency normalisation parameter \(\nu_j\). Since the true concentration of each spike-in molecule is known, BASiCS uses the observed counts of each molecule for each cell to estimate the capture efficiency for each cell, \(\nu_j\). Differences in the global scale across all wells of the level of spike-ins molecules thus do not affect the cell-to-cell ratios of \(\nu_j\). BASiCS assumes that the quantity of spike-ins is consistent in each well in each experiment. While the quantity used must remain constant between wells and experiments, the scale does not affect the results, provided the spike-ins are at a reasonable level. In particular, they should not be in such a low concentration that they are rarely detected, and they should not be at such a high concentration that the majority of the sequencing reads map to the spike-in molecules.

We can now use the molecule count to prepare the BASiCS data object. To incorporate the spike-in molecule counts within the SingleCellExperiment object that BASiCS requires, the first column must contain the names associated to the spike-in genes. The second column must contain the input number of molecules for the spike-in genes (amount per reaction).

# Prepare the data.frame
ERCC_count <- data.frame(
  row.names = ERCC_conc$ERCC.ID,
  Names = ERCC_conc$ERCC.ID,
  count = ERCC_count_final
)

5.4 Single condition example: Naive CD4+ T cells

To highlight the use of BASiCS to analyse cells in the single-condition case, we select naive CD4+ T cells of young B6 animals. Here, BASiCS can be used to identify highly- and lowly-variable genes and calculate interpretable gene- and cell-specific parameters. Among others, these include gene-specific parameters for mean expression, over-dispersion (biological variability) and residual over-dispersion (biological variability after correcting for the mean expression effect).

5.4.1 Preparing the BASiCS data object

Many BASiCS functions use as input objects of the class SingleCellExperiment. The newBASiCS_Data function can be used to create the required SingleCellExperiment object based on the following information:

  • Counts: a matrix of raw expression counts with dimensions \(q\) times \(n\), where \(q\) is the number of genes (technical + biological) and \(n\) is the number of cells. Gene names must be stored as row names of the counts matrix.

  • Tech: a vector of TRUE/FALSE elements with length \(q\). If Tech[i] = FALSE the gene i is biological; otherwise the gene is spike-in. This vector must be specified in the same order of genes as in the counts matrix.

  • SpikeInfo: a data.frame with \(q-q_0\) rows where \(q_0\) is the number of biological genes. The first column must contain the names associated to the spike-in genes. The second column must contain the input number of molecules for the spike-in genes (amount per cell).

  • BatchInfo (optional argument): vector of length \(n\) to indicate batch structure in situations where cells have been processed using multiple batches.

We will first select the naive cells from the SingleCellExperiment object that was generated above.

ind_stimulated <- sce_naive_active$Stimulus == "Unstimulated"
sce_naive <- sce_naive_active[, ind_stimulated]
ind_expressed <- rowSums(counts(sce_naive) > 0) > 1 & 
  rowMeans(counts(sce_naive)) >= 1
sce_naive <- sce_naive[ind_expressed, ]

Next, we will use the newBASiCS_Data function to re-generate the SingleCellExperiment object for the use with BASiCS.

# Here is the first time that we use BASiCS

# Select the ERCC spike-ins of the dataset
counts <- counts(sce_naive)
spikes <- assay(altExp(sce_naive, "spike-ins"))
spikes_present <- rowSums(spikes) != 0
## Remove spike-ins that are not present from matrix and SCE object
spikes <- spikes[spikes_present, ]
altexp_present <- altExp(sce_naive, "spike-ins")[spikes_present, ]
altExp(sce_naive, "spike-ins") <- altexp_present
ind_spike <- c(rep(FALSE, nrow(counts)), rep(TRUE, nrow(spikes)))
spike_input <- ERCC_count[rownames(spikes), ]

# Generate the SingleCellExperiment object
data_naive <- newBASiCS_Data(
  Counts = rbind(counts, spikes),
  Tech = ind_spike,
  SpikeInfo = spike_input,
  BatchInfo = sce_naive$Individuals
)
data_naive
## class: SingleCellExperiment 
## dim: 8427 93 
## metadata(1): SpikeInput
## assays(1): counts
## rownames(8427): ENSMUSG00000000001 ENSMUSG00000000056 ...
##   ENSMUSG00000106607 ENSMUSG00000106620
## rowData names(0):
## colnames(93): do6113 do6118 ... do6398 do6399
## colData names(1): BatchInfo
## reducedDimNames(0):
## altExpNames(1): spike-ins

Alternatively, when using datasets that contain spike-in genes, the original SingleCellExperiment object can be extended by simply specifying a BatchInfo slot in the colData object and by adding the SpikeInfo object to the metadata slot.

data_naive <- sce_naive
colData(data_naive)$BatchInfo <- colData(sce_naive)$Individuals
metadata(data_naive)$SpikeInput <- spike_input

After creating the SingleCellExperiment object that contains all information that BASiCS requires, the MCMC sampler can be run to generate posterior estimates of model parameters.

5.4.2 Running MCMC

It is recommended to run the BASiCS_MCMC sampler for at least 40,000 iterations to ensure convergence. However, if datasets are large and each condition contains hundreds of cells from a homogeneous population, BASiCS can be run with fewer (e.g. 20,000) iterations (see 10X Genomics data). For convenience, BASiCS_MCMC can be run with very few (e.g 1,000) iterations to test whether the sampler breaks.

Here, we run the MCMC sampler for 40,000 iterations, discaring the initial 20,000 samples as burn-in, and using thinning factor of 20, such that only each 20^th sample is stored for later use. Since the dataset contains spike-in genes, we run the sampler with WithSpikes = TRUE and we also want to correct for the mean-variance trend using Regression = TRUE. For further information see Methods.

MCMC_naive <- BASiCS_MCMC(
  Data = data_naive,
  PrintProgress = FALSE,
  N = 40000,
  Thin = 20,
  Burn = 20000,
  Regression = TRUE,
  WithSpikes = TRUE
)

This sampler runs for 167 minutes on a 1.4 GHz Intel Core i5 procesor with 4GB RAM and produces a BASiCS_Chain data object. For comparison, this sampler runs for 97 minutes on a 3.4 GHz Intel Core i7 procesor with 16GB RAM. For convenience, the MCMC chain can be obtained online at https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020.

download_file(
  file = "MCMC_naive.rds",
  website = "https://git.ecdf.ed.ac.uk/vallejosgroup/",
  folder = "basicsworkflow2020/raw/master/MCMCs/",
  destfile = "rds/MCMC_naive.rds"
)

MCMC_naive <- readRDS("rds/MCMC_naive.rds")

The BASiCS_Chain object contains a list of matrices that store the individual MCMC samples per parameter. Each matrix contains the cell- or gene-specific parameters in the columns and the MCMC samples in the rows. BASiCS provides the displayChainBASiCS function to access the cell- or gene-specific parameters. As an example, we access the first 2 samples for \(\mu_i\) of the first 5 genes.

displayChainBASiCS(MCMC_naive, Param = "mu")[1:2, 1:5]
##      ENSMUSG00000000001 ENSMUSG00000000056 ENSMUSG00000000085
## [1,]           9.291713           1.823314           1.237160
## [2,]          15.078529           2.126944           1.494165
##      ENSMUSG00000000088 ENSMUSG00000000131
## [1,]           1.479122           24.36733
## [2,]           1.803549           30.40570

Given that we ran the MCMC samples for 40,000 iterations, discarding the first 20,000 samples as burn-in, and using a thining factor of 20, we obtained 1000 draws for each parameter, stored in the BASiCS_Chain object:

dim(displayChainBASiCS(MCMC_naive, Param = "mu"))
## [1] 1000 8427

5.4.3 Assessing MCMC convergence

In the next step, we need to assess the convergence of the chain to ensure robust downstream analysis. Specifically, we want to ensure that the chain converged to its stationary distribution before the burn-in period ended, and after burn-in that it sampled efficiently from this stationary distribution. There are multiple ways to visualise and assess the convergence of MCMC chains (Cowles and Carlin 1996; Brooks and Gelman 1998). The coda contains diagnostic and plot functions for this task ((Plummer et al. 2006)).

Here, we highlight two ways of assessing the convergence of the MCMC sampler by (i) plotting trace plots, sample densities and autocorrelation, and (ii) plotting the the effective sample size across multiple parameters. Trace plots show the sampled parameter values over time. A chain is likely to have converged when the sample density (in form of a histogram) shows a unimodal distribution. The autocorrelation of an MCMC chain is defined as the Pearson correlation between the chain and time-delayed versions of the chain. The difference in time-points is referred to as ‘lag’. This helps to assess whether a chain has sampled efficiently from its stationary distribution. High autocorrelation indicates that the obtained samples are not independent, and indicates that we may not have enough information about the posterior distribution of the parameters. The chain is likely to be sampling efficiently if the autocorrelation (except for lag = 1) is small (e.g. < 0.25), as it indicates that the stored samples are largely independent.

Effective sample size is a measure of the number of independent samples generated for a model parameter (Gelman et al. 2014). Simply, it is defined as the number of samples taken relative to the total autocorrelation. More formally, it is defined as follows: \[ \mbox{ESS} = \frac{n}{1 + 2\sum_{k=1}^\infty \rho(k)} \] where \(n\) is the number of samples and \(\rho(k)\) is the autocorrelation at lag \(k\). We can visualise this parameter by plotting histograms of the effective sample size over all genes, and by plotting effective sample size against mean expression or over-dispersion for all genes.

# Convergence of mean expression parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "mu", Gene = 1)

plot(MCMC_naive, Param = "mu", Gene = 7216)

# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "mu")

# Histogram of sample size for mu, as some genes seem to have low values for mu.
BASiCS_DiagHist(MCMC_naive, Param = "mu")

# Convergence of over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "delta", Gene = 100)

plot(MCMC_naive, Param = "delta", Gene = 5000)

# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "delta")

# Convergence of residual over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "epsilon", Gene = 200)

plot(MCMC_naive, Param = "epsilon", Gene = 8000)

# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "epsilon")

# Convergence of mRNA capture efficiency parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "s", Cell = 10)

plot(MCMC_naive, Param = "s", Cell = 50)

# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "s")

We can see several genes show very low sample size for mu. We can identify these using BASiCS_EffectiveSize.

ess <- BASiCS_EffectiveSize(MCMC_naive, Param = "mu")
low_ess_genes <- names(ess[ess < 100])
hist(
  ess,
  breaks = "FD",
  main = "Effective sample size",
  xlab = "Effective sample size"
)
abline(v = 100, lty = "dashed", col = "grey60")

head(low_ess_genes)
## [1] "ENSMUSG00000000148" "ENSMUSG00000001054" "ENSMUSG00000001524"
## [4] "ENSMUSG00000001751" "ENSMUSG00000002103" "ENSMUSG00000002280"

Given that the sample has generated less than 100 independent samples for these genes, there may not be sufficient posterior information to reliably calculate differences in mean expression, and it would be appropriate to exclude these genes from downstream analysis. As we will see later, BASiCS_TestDE automatically excludes these genes from tests of differential expression.

5.5 Downstream analysis

In this section, we will highlight the use of BASiCS when analysing cells of a single condition. This includes normalisation, variance decomposition, detection of highly and lowly variable genes, and the use of gene-specific parameters as interpretable variability measures. Furthermore, we will compare the results in the individual steps with results obtained using the Biocpkg("scran").

5.5.1 Normalisation

Posterior estimates of cell-specific parameters can be used to normalise the data and correct for biases in mRNA content (Vallejos et al. 2017). To perform normalisation, BASiCS provides the BASiCS_DenoisedCounts and the BASiCS_DenoisedRates functions. These functions produce normalised expression values, with technical variation removed. They differ in that the former returns a denoise expression count, such that \[ x^*_{ij} = \frac{ x_{ij} } {\hat{\phi}_j \hat{\nu}_j}, \] represents a normalised and denoised expression count. The latter returns a rate, such that \[ \Lambda_{ij} = \hat{\mu_i} \hat{\rho}_{ij} \] is the denoised expression rate for a cell and gene. Both of these functions take the SingleCellExperiment and BASiCS_Chain objects as inputs.

counts_denoised <- BASiCS_DenoisedCounts(Data = data_naive, Chain = MCMC_naive)

These normalised counts or rates can be further used for dimensionality reduction and clustering as explained elsewhere (Lun, McCarthy, and Marioni 2016; Amezquita et al. 2019). Alternatively, a dimensionality reduction scheme that operates directly on the count matrix could be used (Townes et al. 2019; Lopez et al. 2018).

5.5.2 Lowly and highly variable gene detection

BASiCS offers a function to select genes with large or small biological variance. If the MCMC sampler was run with Regression = TRUE (default), the BASiCS_DetectHVG and BASiCS_DetectLVG functions take the BASiCS_Chain object and a quantile threshold. Using the residual over-disperison parameters, genes are ranked by their variability and the BASiCS_DetectHVG function selects (for example) the 10% most highly variable genes (PercentileThreshold = 0.9). Similarly, when detecting lowly variable genes, the BASiCS_DetectLVG selects the 10% most lowly variable genes (PercentileThreshold = 0.1). The propability threshold for a gene showing higher variability than the percentile threshold is found by controling the EFDR to 10% (default).

# Highly variable genes
HVG <- BASiCS_DetectHVG(MCMC_naive, PercentileThreshold = 0.9)

# Lowly variable genes
LVG <- BASiCS_DetectLVG(MCMC_naive, PercentileThreshold = 0.1)
HVG_table <- HVG@Table
LVG_table <- LVG@Table

This analysis results in the detection of 168 highly variable genes and 709 lowly variable genes.

The Biocpkg("scran") provides similar functions to detect HVGs and we can compare the results of both methods. scran first fits a smooth regression between the variance of the log-transformed expression of the spike-in transcripts and their mean abundance using the trendVar function. Afterwards, the decomposeVar function decomposes the gene-specific variance into a biological and technical component.

# Variance decomposition
# var_out <- scran::modelGeneVar(sce_naive)
var_out <- scran::modelGeneCV2(sce_naive)
var_out <- var_out[HVG_table$GeneName, ]

As proposed by SimpleSingleCell and by OSCA, HVGs can be defined as those genes displaying a biolgical variance component of larger than 0.5, while controlling the FDR to 5% (Lun, McCarthy, and Marioni 2016; Amezquita et al. 2019). We can then compare the the overlap of the HVG identified by scran and by BASiCS.

is_hvg <- var_out$ratio > 2 & var_out$FDR < 0.05
# is_hvg <- var_out$FDR < 0.1
ind_hvg <- which(is_hvg)
hvg_out <- var_out[ind_hvg, ]
hvg_out <- hvg_out[order(hvg_out$FDR), ]
nrow(hvg_out)
## [1] 540
# Intersection between BASiCS and scran HVG
length(intersect(rownames(hvg_out), HVG_table$GeneName[HVG_table$HVG]))
## [1] 62
# Compare BASiCS and scran results
plot_data <- data.frame(
  Mu = HVG_table$Mu,
  Epsilon = HVG_table$Epsilon,
  # HVG = HVG_res
  BASiCS = ifelse(HVG_table$HVG, "HVG", "Not HVG"),
  scran = ifelse(is_hvg, "HVG", "Not HVG")
)
plot_data <- plot_data[order(plot_data$scran, plot_data$BASiCS, decreasing=TRUE), ]
plot_data$Status <- paste0(
  "BASiCS: ", plot_data$BASiCS, "\n",
  "scran: ", plot_data$scran, "\n")
ggplot(plot_data) +
  geom_point(
    aes(log(Mu), Epsilon, colour = Status),
    alpha = 0.6
  )

62 of the 540 HVGs of scran are among the 168 HVGs identified by BASiCS.

5.5.3 Mean-variance trend

TODO: Move this up

Numerous studies have highlighted the relationship between gene-specific variability measures (e.g., squared coefficient of variation) and mean abundance (Ritchie et al. 2015; Grün, Kester, and Oudenaarden 2014; Love, Huber, and Anders 2014; McCarthy, Chen, and Smyth 2012). BASiCS provides the BASiCS_ShowFit function that plots the gene-specific over-dispersion parameters (delta) versus mean expression parameters (mu).

BASiCS_ShowFit(MCMC_naive)

Here, we observe that the over-dispersion estimates are inversely correlated with mean expression. However, by performing a regression between over-dispersion and mean expression, we can correct for this trend and obtain variability measures that show no correlation with mean expression (Eling et al. 2018). The purple points in the plot indicate genes that are not expressed in at least 2 cells. BASiCS automatically excludes these genes due to challenges in interpreting variability estimates and changes in variability for genes that are only expressed in one cell.

We can now compare these gene-specific variability measures (over-dispersion and residual over-dispersion) to previously used measures to quantify cell-to-cell expression variability (Brennecke et al. 2013, @Kolodziejczyk2015cell).

5.5.4 Comparison to variance, Fano factor, CV^2, and DM

Widely used measures of expression variability include the variance (Shalek et al. 2014), the Fano factor (variance divided by mean expression) (Wills et al. 2013; Grün, Kester, and Oudenaarden 2014; Arriaga 2009) and the coefficient of variation (CV, variance divided by squared mean expression) (Buettner et al. 2015; Brennecke et al. 2013). Here, we will highlight the mean-variance realtionship for each variability measures. For this analysis, we exclude genes that are not expressed in at least 2 cells. These genes can be identified in the displayChainBASiCS(MCMC_naive, Param = "epsilon") to contain NA. We will use the ggpointdensity to visualise the density of the genes along the axes of mean and variability.

library("ggpointdensity")
library("viridis")
# Exclude ERCCs
counts_denoised <- counts_denoised[!grepl("ERCC", rownames(counts_denoised)), ]

# Exclude genes
eps_not_na <- !is.na(displayChainBASiCS(MCMC_naive, Param = "epsilon")[1, rownames(counts_denoised)])
counts_denoised <- counts_denoised[eps_not_na, ]

# Variance
var_genes <- apply(counts_denoised, 1, var)
mean_genes <- apply(counts_denoised, 1, mean)

ggplot(
    data.frame(
      variance = var_genes,
      mean = mean_genes
    )
  ) +
  geom_pointdensity(aes(log(mean), log(variance))) +
  scale_color_viridis()

# Fano factor
ggplot(
    data.frame(
      fano = var_genes / mean_genes,
      mean = mean_genes
    )
  ) +
  geom_pointdensity(aes(log(mean), log(fano))) +
  scale_color_viridis()

# Squared coefficient of variation
ggplot(
    data.frame(
      CV2 = var_genes / mean_genes^2,
      mean = mean_genes
    )
  ) +
  geom_pointdensity(aes(log(mean), log(CV2))) +
  scale_color_viridis()

We see that all measures correlate with mean expression. The same is true for the over-dispersion parameters estimated by BASiCS, as shown below. Again, for this comparison, we exclude genes that are not observed to be expressed in at least 2 cells

summary_naive <- Summary(MCMC_naive)
# Remove genes
eps_not_na <- !is.na(displaySummaryBASiCS(summary_naive, Param = "epsilon")[, "median"])
mu_naive <- displaySummaryBASiCS(summary_naive, Param = "mu")[eps_not_na, ]
delta_naive <- displaySummaryBASiCS(summary_naive, Param = "delta")[eps_not_na, ]

# Over-dispersion versus mean expression
ggplot(
    data.frame(
      mu = mu_naive[, "median"],
      delta = delta_naive[, "median"]
    )
  ) +
  geom_pointdensity(aes(log(mu), log(delta))) +
  scale_color_viridis()

# Compare delta to CV2
ggplot(
    data.frame(
      CV2 = var_genes / mean_genes^2,
      delta = delta_naive[rownames(counts_denoised), "median"]
    )
  ) +
  geom_pointdensity(aes(log(CV2), log(delta))) +
  scale_color_viridis()

The over-dispersion parameters estimated using BASiCS show strong correlation with CV^2. Recently, we extended BASiCS to avoid the mean-variability relationship by performing an internal regression between the over-dispersion and mean expression parameters (as visualised in figure Figure showFit). Similarly, Kolodziejczyk et al. used the distance to a rolling median (DM) along the mean-variability trend to correct for this confounding factor (Kolodziejczyk et al. 2015). Here, we highlight how to obtain the residual variability estimates using BASiCS and scran.

# Residual over-dispersion estimates
epsilon_naive <- displaySummaryBASiCS(summary_naive, Param = "epsilon")[eps_not_na, ]

# Residual over-dispersion versus mean expression
ggplot(
    data.frame(
      mu = mu_naive[, "median"],
      epsilon = epsilon_naive[, "median"]
    )
  ) +
  geom_pointdensity(aes(log(mu), epsilon)) +
  scale_color_viridis()

# DM values
DM.naive <- scran::DM(mean = mean_genes, cv2 = var_genes / mean_genes^2)

# DM versus mean expression
ggplot(
    data.frame(
      mean = mean_genes,
      DM = DM.naive
    )
  ) +
  geom_pointdensity(aes(log(mean), DM)) +
  scale_color_viridis()

# Compare residual over-dispersion and DM
ggplot(
    data.frame(
      epsilon = epsilon_naive[rownames(counts_denoised), "median"],
      DM = DM.naive
    )
  ) +
  geom_pointdensity(aes(epsilon, DM)) +
  scale_color_viridis()

Neither the DM nor the residual over-dispersion estimates show association with mean expression. Furthermore, the mean-independent variability measures display high correlation. These measures can be used to associate genomic features (Morgan and Marioni 2018, @Faure2017) or transcriptional dynamics (Antolović et al. 2017) to gene expression variability. While the DM is calculated as a point estimate, BASiCS stores each posterior sample within the BASiCS_Chain object. They can be accessed using the displayChain function, which displays cell- or gene-specific samples in form of a matrix where each column contains cell- or gene-specific paramters and rows contain the MCMC samples.

displayChainBASiCS(MCMC_naive, Param = "epsilon")[1:10, 1:10]
##       ENSMUSG00000000001 ENSMUSG00000000056 ENSMUSG00000000085
##  [1,]        -0.07536681         0.06507879         0.22216924
##  [2,]        -0.15479335        -0.13690433         0.15270319
##  [3,]        -0.06352698         0.22609939         0.46427119
##  [4,]        -0.01046080        -0.07029727        -0.02828954
##  [5,]        -0.17238463         0.02289810        -0.43330673
##  [6,]         0.20352617         0.18347470         0.17110981
##  [7,]         0.11240828        -0.07869260        -0.25011630
##  [8,]        -0.21104385         0.37517279         0.55930148
##  [9,]        -0.02664068         0.27998016        -0.24280289
## [10,]        -0.09974358         0.41612638         0.56761640
##       ENSMUSG00000000088 ENSMUSG00000000131 ENSMUSG00000000134
##  [1,]         -0.5857537        -0.08527377        -0.01618740
##  [2,]         -0.9111119        -0.05000047         0.05822246
##  [3,]         -0.4193048        -0.07452967        -0.88827422
##  [4,]         -0.7675487         0.00377885        -0.05168619
##  [5,]         -1.4347722         0.19173647        -0.40196125
##  [6,]         -0.8033958         0.04320052        -0.35513456
##  [7,]         -0.6330546        -0.08919323        -0.23512520
##  [8,]         -0.8704263        -0.19716270        -0.34956891
##  [9,]         -0.7086146        -0.05878905        -0.44520679
## [10,]         -0.8728414        -0.09782059        -0.03920229
##       ENSMUSG00000000142 ENSMUSG00000000148 ENSMUSG00000000149
##  [1,]        -0.68935237         0.78639699        -0.53120733
##  [2,]        -0.30929173         0.36928298         0.26681379
##  [3,]         0.02639998         0.11801540         0.48503581
##  [4,]        -0.33804823         0.14937127         0.07146526
##  [5,]         0.10107506         0.35137627        -0.18575075
##  [6,]        -0.08564049         0.08878641        -0.28882838
##  [7,]        -0.05575148         0.29413602         0.28754328
##  [8,]         0.20641407        -0.14764891        -0.08439752
##  [9,]        -0.37768837         0.04268745         0.56363360
## [10,]        -0.25176882         0.53790689         0.49732032
##       ENSMUSG00000000168
##  [1,]        -0.08912147
##  [2,]         0.43198180
##  [3,]        -0.35804769
##  [4,]        -0.09967217
##  [5,]        -0.10950740
##  [6,]        -0.10902991
##  [7,]         0.26217965
##  [8,]        -0.56934729
##  [9,]         0.19241219
## [10,]        -0.11667154

By testing a certain association (for example between CpG island length and variability (Morgan and Marioni 2018)) for each MCMC sample, one can generate a post-hoc posterior distribution of the test statistic.

The workflow so far highlights the use of BASiCS for analysing cells of a single condition.

5.6 Differential testing between naive and activated CD4+ T cells (two group example)

This section highlights the use of BASiCS to perform differential testing (mean and variability) between cells of two condtions. For convenience, we will compare the naive CD4+ T cells, which were analysed in the previous section to activated CD4+ T cells of the same dataset (Martinez-Jimenez et al. 2017). Naive CD4+ T cells were activated for 3 hours using plate-bound CD3e and CD28 antibodies. T cell activation is linked to strong transcriptional shifts and the up-regulation of lineage specific marker genes, such as Tbx21 and Gata1 (Best et al. 2013; Fu et al. 2012). To generate this data, the authors did not add cytokines, which are needed for T cell differentiation (Zhu and Paul 2010), meaning that any heterogeneity in the activated cell population does not arise from cells residing in different lineage-specific differentiation states. Prior to differential testing, and as explained above, we need to generate a SingleCellExperiment object that is compatible for processing using BASiCS.

5.6.1 Creating the BASiCS Data objects

We have performed quality control on the naive and activated CD4+ T cells above when preparing the BASiCS_Data object. Therefore, we can directly select the activated CD4+ T cells from the sce_naive_active object.

ind_active <- sce_naive_active$Stimulus == "Active"
sce_active <- sce_naive_active[, sce_naive_active$Stimulus == "Active"]

Similar to the procedure described above in the single condition example, we will use the newBASiCS_Data function to re-generate the SingleCellExperiment object for the use with BASiCS.

## select the biological genes of the dataset
counts <- counts(sce_active)
## Select the ERCC spike-ins of the dataset
spikes <- assay(altExp(sce_active, "spike-ins"))
spikes_present <- rowSums(spikes) != 0
## Remove spike-ins that are not present from matrix and SCE object
spikes <- spikes[spikes_present, ]
altexp_present <- altExp(sce_active, "spike-ins")[spikes_present, ]
altExp(sce_active, "spike-ins") <- altexp_present
ind_spike <- c(rep(FALSE, nrow(counts)), rep(TRUE, nrow(spikes)))
spike_input <- ERCC_count[rownames(spikes), ]


# Generate the SingleCellExperiment object
data_active <- newBASiCS_Data(
  Counts = rbind(counts, spikes),
  Tech = ind_spike,
  SpikeInfo = spike_input,
  BatchInfo = sce_active$Individuals
)

## Subset to common genes with naive
data_active <- data_active[rownames(data_naive)]

5.6.2 Running the MCMC

We can use this SingleCellExperiment object as an input to BASiCS_MCMC and run the MCMC sampler over 40,000 iterations.

MCMC_active <- BASiCS_MCMC(
  Data = data_active,
  PrintProgress = FALSE,
  N = 40000,
  Thin = 20,
  Burn = 20000,
  Regression = TRUE,
  WithSpikes = TRUE
)

This sampler runs for 98 minutes on a 1.4 GHz Intel Core i5 processor with 4GB RAM and produces a BASiCS_Chain data object. The same sampling run completed in 59 minutes on a 3.4 GHz Intel Core i7 processor with 16GB RAM. For convenience, this MCMC chain can be again obtained online at https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/.

download_file(
  file = "MCMC_active.rds",
  website = "https://git.ecdf.ed.ac.uk/vallejosgroup/",
  folder = "basicsworkflow2020/raw/master/MCMCs/",
  destfile = "rds/MCMC_active.rds"
)
MCMC_active <- readRDS("rds/MCMC_active.rds")

5.6.3 Quality checks

Similar to the BASiCS_Chain quality checks described above, we will again profile the convergence of the chain using a visual inspection of the trace plots, sample histograms and autocorrelation of individual chains. Furthermore, we will use the BASiCS_DiagPlot function to assess the effective sample size of all chains per parameter class.

# Convergence of mean expression parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "mu", Gene = 1)

plot(MCMC_active, Param = "mu", Gene = 1000)

# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "mu")

# Histogram of effective sample size for mu, as some genes seem to have low values for mu.
BASiCS_DiagHist(MCMC_active, Param = "mu")

# Convergence of over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "delta", Gene = 100)

plot(MCMC_active, Param = "delta", Gene = 5000)

# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "delta")

# Convergence of residual over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "epsilon", Gene = 200)

plot(MCMC_active, Param = "epsilon", Gene = 5000)

# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "epsilon")

# Convergence of mRNA capture efficiency parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "s", Cell = 10)

plot(MCMC_active, Param = "s", Cell = 50)

# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "s")

To highlight the regression trend, we can use the BASiCS_ShowFit function.

BASiCS_ShowFit(MCMC_active)

The MCMC sampler converged and the regression captured the full range of data points similar to the regression done on naive CD4+ T cells. We can therefore move on to perform differential testing between naive and activated CD4+ T cells.

5.7 Differential testing

BASiCS_DiagHist(MCMC_active, Param = "mu")

BASiCS_DiagHist(MCMC_active, Param = "delta")

BASiCS_DiagPlot(MCMC_active, Param = "mu")

BASiCS_DiagPlot(MCMC_active, Param = "delta")

5.7.1 Differential mean expression

To perform robust differential mean expression testing, BASiCS removes genes with small effective sample size from EFDR calibration and differential expression testing. As explained above, BASiCS automatically excludes genes with an effective sample size less than 100.

The default settings for differential mean expression testing are as follows:

  • EpsilonM: Log2 fold change (LFC) threshold for changes in mean expression. Default value is \(\log_2(1.5)\approx0.41\).
  • MinESS: Minimum effective sample size for genes to be included in differential tests. Default value is 100.
  • EFDR_M: Expected false discovery rate: 10%
  • Plot, PlotOffset: Boolean to control if results are plotted. Default value is TRUE for both.
# Perform differential testing
Test_DE <- BASiCS_TestDE(
  Chain1 = MCMC_naive,
  Chain2 = MCMC_active,
  EpsilonM = log2(1.5),
  GroupLabel1 = "Naive",
  GroupLabel2 = "Active",
  Plot = FALSE,
  PlotOffset = FALSE,
  CheckESS = TRUE,
  MinESS = 100
)

After running the test, we can now visaulize the results in form of a MA-plot (log ratio versus mean average) and volcano plot (posterior probability versus log ratio).

colour_map <- c(
  "Naive+" = "dark blue",
  "Active+" = "dark red",
  "NoDiff" = "black",
  "ExcludedLowESS" = "grey60",
  "ExcludedByUser" = "grey80"
)

BASiCS_PlotDE(Test_DE, Parameters = "Mean", Plots = "MA")

BASiCS_PlotDE(Test_DE, Parameters = "Mean", Plots = "Volcano")

TableMean <- format(Test_DE, Which = "Mean", Filter = FALSE)
TableMean$ResultDiffMean <- factor(
  TableMean$ResultDiffMean,
  levels = names(colour_map)
)
ord <- order(TableMean$ResultDiffMean, decreasing = TRUE)
TableMean <- TableMean[ord, ]
colour_scale <- scale_color_manual(values = colour_map)
# Visualise MA plot
ggplot(TableMean) +
  geom_point(
    aes(log(MeanOverall), MeanLog2FC, colour = ResultDiffMean),
    shape = 16,
    alpha = 0.5
  ) +
  colour_scale

# Visualise MA plot
ggplot(TableMean) +
  geom_point(
    aes(MeanLog2FC, ProbDiffMean, colour = ResultDiffMean),
    shape = 16,
    alpha = 0.5
  ) +
  colour_scale

TODO: we can also plot this using BASiCS_PlotDE

As we can see for the comparison of naive and activated CD4+ T cells, most genes show strong differences in mean expression. In such cases, it can be beneficial to increase the LFC threshold or to decrease the threshold for the EFDR. Here, we therfore set the LFC threshold to \(\log_2(2)=1\) to detect genes with strong changes in mean expression. We also set MinESS to 100. This causes genes with effective sample size less than 100 in either input chain to be excluded from EFDR calibration and differential expression testing.

# Perform differential testing
Test_DE <- BASiCS_TestDE(
  Chain1 = MCMC_naive,
  Chain2 = MCMC_active,
  EpsilonM = log2(2),
  GroupLabel1 = "Naive",
  GroupLabel2 = "Active",
  Plot = FALSE,
  PlotOffset = FALSE,
  CheckESS = TRUE,
  MinESS = 100
)
TableMean <- format(Test_DE, Which = "Mean", Filter = FALSE)
TableDisp <- format(Test_DE, Which = "Disp", Filter = FALSE)
table(TableMean$ResultDiffMean)
## 
##        Active+ ExcludedLowESS         Naive+         NoDiff 
##           1157            407           1043           5820

In this case, this results in 407 genes with low effective sample size being excluded from differential mean expression tests and 35 genes being excluded from tests of differential over-dispersion. To understand the regulatory programmes that underlie T cell activation, a variety of downstream analyses can be peformed using the differentially expressed genes.

While other computational tools exist to perform differential mean expression analysis, we next want to highlight the use of BASiCS for differential variability testing.

5.7.2 Differential over-dispersion

Due to the negative association between over-dispersion and mean expression parameters, only genes that do not show a change in mean expression. To avoid the confounding effect of mean expression, we perform differential testing by setting the LFC threshold on mean expression to EpsilonM = 0, while using the default LFC threshold on changes in over-dispersion: EpsilonD = log2(1.5). Furthermore, it is crucial to exclude lowly expressed genes from this analysis to avoid biases arising from non-informative genes showing only low levels of stochastic expression.

# Select genes that show expression in both conditions
genes_select <- (TableMean$Mean1 > 1 & TableMean$Mean2 > 1)

Test_DE_LFC0 <- BASiCS_TestDE(
  Chain1 = MCMC_naive,
  Chain2 = MCMC_active,
  EpsilonM = 0,
  GroupLabel1 = "Naive",
  GroupLabel2 = "Active",
  Plot = FALSE,
  PlotOffset = FALSE,
  CheckESS = TRUE,
  MinESS = 100,
  GenesSelect = genes_select
)

We first select the genes that remain similarly expressed between both conditions and highlight the differential over-dispersion results in form of MA- and boxplots.

# Select genes with no changes in mean expression
TableMean0 <- format(Test_DE_LFC0, Which = "Mean", Filter = FALSE)
ind_nochange <- TableMean0$ResultDiffMean == "NoDiff"
TableDisp0 <- format(Test_DE_LFC0, Which = "Disp", Filter = FALSE)

BASiCS_PlotDE(Test_DE_LFC0, Parameters = "Disp", Plot = "MA")

# # MA plot
# ggplot(TableDisp0[ind_nochange, ]) +
#   geom_point(
#     aes(
#       log(TableMean0$MeanOverall[ind_nochange]),
#       DispLog2FC,
#       colour = ResultDiffDisp
#     ),
#     shape = 16,
#     alpha = 0.5
#   ) +
#   scale_color_manual(
#     values = c(
#       "Naive+" = "dark blue",
#       "Active+" = "dark red",
#       "NoDiff" = "black"
#     )
#   )

# Boxplot
wilcox.test(TableDisp0$DispLog2FC[ind_nochange])
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  TableDisp0$DispLog2FC[ind_nochange]
## V = 1948850, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
boxplot(TableDisp0$DispLog2FC[ind_nochange],
  ylab = "LFC in over-dispersion", outline = FALSE
)
abline(a = 0, b = 0, lwd = 2, col = "dark red")

wilcox.test(TableDisp0$DispLog2FC[ind_nochange])
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  TableDisp0$DispLog2FC[ind_nochange]
## V = 1948850, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0

With this analysis, we detect increased over-dispersion in naive CD4+ T cells for genes that show similar expression levels between naive and activated CD4+ T cells.

5.7.3 Differential residual over-dispersion

While the analysis in the previous section is well suited to detect global changes in variability (e.g. detecting if one cell population overall displays higher expression variabilty), it does not allow the testing of changes in mean expression and expression variability in parallel. For this, BASiCS compares the residual over-dispersion parameters, which do not scale with mean expression, between the two conditions. Here, we filter on genes that are lowly expressed in both conditions and, as explained above, remove genes for which the MCMC sampler obtained a low effective sample size:

# Remove lowly expressed genes
low_expr <- !(TableMean$Mean1 < 1 & TableMean$Mean2 < 1)

# Genes to exclude
genes_select <- low_expr

We can now perform differential testing as shown above. Again, we use a LFC threshold higher than the default to capture strong changes in mean expression.

# Perform differential testing
Test_DE <- BASiCS_TestDE(
  Chain1 = MCMC_naive,
  Chain2 = MCMC_active,
  EpsilonM = log2(2),
  GroupLabel1 = "Naive",
  GroupLabel2 = "Active",
  CheckESS = TRUE,
  MinESS = 100,
  Plot = FALSE,
  PlotOffset = FALSE,
  GenesSelect = genes_select
)

We can now visualise the changes in residual over-dispersion between naive and activated CD4+ T cells in the form of a MA-plot. In this visualisation, the difference between the posterior medians of the residual over-dispersion parameters \(\epsilon\) is shown on the y-axis. Epsilon values for genes that are not expressed in at least 2 cells per conditions are marked as NA and are therefore not being displayed.

TableResDisp <- format(Test_DE, Which = "ResDisp", Filter = FALSE)
colour_map <- c(
  "Naive+" = "dodgerblue",
  "Active+" = "firebrick",
  "NoDiff" = "black",
  "ExcludedByUser" = "grey",
  "ExcludedLowESS" = "grey80",
  "ExcludedFromTesting" = "grey50"
)
BASiCS_PlotDE(Test_DE, Parameters = "ResDisp", Plot = "MA")

TableResDisp$ResultDiffResDisp <- factor(
  TableResDisp$ResultDiffResDisp,
  levels = names(colour_map)
)
ord <- order(TableResDisp$ResultDiffResDisp, decreasing = TRUE)
TableResDisp <- TableResDisp[ord, ]
ggplot(TableResDisp) +
  geom_point(
    aes(
      x = log(MeanOverall),
      y = ResDispDistance,
      colour = ResultDiffResDisp
    ),
    shape = 16,
    alpha = 0.5
  ) +
  scale_color_manual(values = colour_map)

While one could perform GO analysis (as explained above) on the genesets that change in residual over-dispersion, here, we want to highlight how to analyse changes in mean expression in parallel to changes in variability. For this, we will first combine the results of the differential mean expression and the differential residual over-dispersion test. We will further remove the genes that were excluded from the test and those that are not expressed in at least 2 cells in either condition.

cols <- setdiff(colnames(TableResDisp), c("GeneName", "MeanOverall"))
res_df <- cbind(TableMean, TableResDisp[, cols])
ind_exclude <- res_df$ResultDiffResDisp == "ExcludedByUser" |
  res_df$ResultDiffResDisp == "ExcludedFromTesting"
res_df <- res_df[!ind_exclude, ]

Next, we can visualise the regulation of each individual gene based on its changes in mean expression and expression variability.

BASiCS_PlotDE(Test_DE, Parameters = "ResDisp", Plot = "MA")

res_df$ResultDiffMean[grep("Excluded", res_df$ResultDiffMean)] <- "NoDiff"
res_df$ResultDiffResDisp[grep("Excluded", res_df$ResultDiffResDisp)] <- "NoDiff"
ggplot(res_df) +
  geom_point(
    aes(
      MeanLog2FC,
      ResDispDistance,
      colour = interaction(ResultDiffMean, ResultDiffResDisp, sep = ", ")
    ),
    alpha = 0.8,
    shape = 16
  ) +
  scale_color_brewer(name = "Categories", palette = "Paired") +
  labs(x = bquote(log[2](FC)), y = "Difference in residual over-dispersion")

6 Differential testing using differentiating cells (two group, droplet-based example)

With the development of droplet-based scRNA-seq (Klein et al. 2015; Macosko et al. 2015) lead to a strong increase in the number of cells that can be profiled per experiment. With this, large-scale scRNA-seq datasets have been generated to study development across multiple time-points and capturing multiple tissues (Ibarra-Soria et al. 2018; Kernfeld et al. 2018). Here we describe the computational analysis of changes in mean expression and transcriptional variability when data is sparse and technical spike-in genes are missing. For this, we compare cells of the presomitic mesoderm and somitic mesoderm using droplet-based scRNA-seq data (Ibarra-Soria et al. 2018).

6.1 Obtaining the data

The full dataset is stored under the accession number E-MTAB-6153 on ArrayExpress and can be obtained via:

if (!file.exists("downloads/rawCounts.tsv")) {
  website <- "https://www.ebi.ac.uk/"
  folder <- "arrayexpress/files/E-MTAB-6153/"
  file <- "E-MTAB-6153.processed.2.zip"
  download.file(
    paste0(website, folder, file),
    destfile = "downloads/rawCounts.zip"
  )
  unzip(zipfile = "downloads/rawCounts.zip", exdir = "downloads")
  file.remove("downloads/rawCounts.zip")
}
rawCounts <- read.delim("downloads/rawCounts.tsv", header = TRUE)

Of note: The file is 65 MB in size while the unzipped, raw counts measure 873 MB in size.

The cluster labels of the original publication ca be obtained via:

if (!file.exists("downloads/cellAnnotation.tsv")) {
  website <- "https://www.ebi.ac.uk/"
  folder <- "arrayexpress/files/E-MTAB-6153/"
  file <- "E-MTAB-6153.processed.3.zip"
  download.file(
    paste0(website, folder, file),
    destfile = "cluster_labels.zip"
  )
  unzip(zipfile = "cluster_labels.zip", exdir = "downloads")
}

cluster_labels <- read.table("downloads/cellAnnotation.tsv",
  sep = "\t", header = TRUE, stringsAsFactors = FALSE)

6.2 Select populations of interest

We select the somitic and pre-somitic mesoderm cells to perform differential testing. Prior to running the MCMC, we want to control for outlying cells and heterogeneous substructure in both cell populations.

cluster_labels[["Cell_type"]] <- cluster_labels$cellType
cluster_labels[["Cell_type"]] <- sub(
  "presomiticMesoderm",
  "PSM",
  cluster_labels[["Cell_type"]]
)
cluster_labels[["Cell_type"]] <- sub("somiticMesoderm",
  "SM", 
  cluster_labels[["Cell_type"]]
)

ind_som <- which(cluster_labels[["Cell_type"]] == "PSM" |
  cluster_labels[["Cell_type"]] == "SM")
rawCounts <- rawCounts[, ind_som]
cluster_labels <- cluster_labels[ind_som, ]

6.3 Generating SingleCellExperiment object and quality control

For pre-processing and visualisation purposes, we load the data into a SingleCellExperiment object. The metadata will be stored in the colData slot.

droplet_sce <- SingleCellExperiment(
  assays = list(counts = as(as.matrix(rawCounts), "dgCMatrix"))
)
rm(rawCounts)
colData(droplet_sce) <- DataFrame(
  cluster_labels,
  subCellType = sub("_.*", "", cluster_labels$cell)
)

For further processing steps, we remove lowly expressed genes.

ind_expressed <- Matrix::rowMeans(counts(droplet_sce)) > 0.1
droplet_sce <- droplet_sce[ind_expressed, ]

To visualise possible sub-structure in the data, we normalise both cell populations using the scran package.

droplet_sce <- computeSumFactors(droplet_sce,
  clusters = colData(droplet_sce)$subCellType
)
droplet_sce <- logNormCounts(droplet_sce)

Next, we compute a PCA using the scater package.

droplet_sce <- runPCA(droplet_sce)

We can now visualise the different factors stored in the colData slot.

# Cell types identified by clustering
plotReducedDim(droplet_sce, dimred = "PCA", colour_by = "subCellType") +
  scale_fill_manual(name = "cellType", values = c("coral4", "steelblue", "limegreen"))

The first PC separates the two different cell types while the second PC captures outlying cells. We will remove these outliers and the intermediate cell population from down-stream analysis.

ind_retain <- reducedDims(droplet_sce)$PCA[, 2] > -5 &
  colData(droplet_sce)$subCellType != "presomiticMesoderm.b"
droplet_sce <- droplet_sce[, ind_retain]

We now collected the cells that we want to process using BASiCS. For this, we will generate the BASiCS data objects.

6.4 Generating BASiCS data objects

Since droplet-based scRNA-seq data are generated without including technical spike-in genes, BASiCS uses measurement error models to quantify technical variation through replication (Carroll 1998). Here, it is crucial to provide batch information to the BASiCS model. In the case of the somitic and pre-somoitic mesoderm cells, embryos of two mice have been used to generate the data. Cells isolated from the first embryo were split into two batches and processed independently. To capture cell-type extrinsic, biological variation between the two mice, we pool cells from the two batches of the first animal and only considere cells from mouse 1 and mouse 2 as replicates.

# Presomitic mesoderm
ind_presom <- colData(droplet_sce)[["Cell_type"]] == "PSM"
cur_counts <- droplet_sce[, ind_presom]
cur_batch <- round(colData(cur_counts)$sample, digits = 0)

PSM_Data <- newBASiCS_Data(
  Counts = as.matrix(counts(cur_counts)),
  Tech = rep(FALSE, nrow(droplet_sce)),
  SpikeInfo = NULL,
  BatchInfo = cur_batch
)

# Somitic mesoderm
ind_som <- colData(droplet_sce)[["Cell_type"]] == "SM"
cur_counts <- droplet_sce[, ind_som]
cur_batch <- round(colData(cur_counts)$sample, digits = 0)

SM_Data <- newBASiCS_Data(
  Counts = as.matrix(counts(cur_counts)),
  Tech = rep(FALSE, nrow(droplet_sce)),
  SpikeInfo = NULL,
  BatchInfo = cur_batch
)

6.5 Running the MCMC

We next estimate model parameters by running the MCMC cell-type specifically. Due to the high cell number (1150 for the pre-somitic mesoderm and 739 for the somitic mesoderm), we set the number of iterations to 20000. In this case, we used the regression BASiCS model to additionally estimate residual over-dispersion parameters.

# Presomitic mesoderm cells
PSM_MCMC <- BASiCS_MCMC(
  PSM_Data,
  N = 20000,
  Thin = 10,
  Burn = 10000,
  Regression = TRUE
)

# Somitic mesoderm cells
SM_MCMC <- BASiCS_MCMC(
  SM_Data,
  N = 20000,
  Thin = 10,
  Burn = 10000,
  Regression = TRUE
)

Running these MCMC will take around 8-12 hours on a standard PC (2.6 GHz i5, 8 GB RAM, using 1 core). Here, we provide these chains to download from:

website <- "https://git.ecdf.ed.ac.uk/"
folder <- "vallejosgroup/basicsworkflow2020/raw/master/MCMCs/"
PSM_MCMC <- readRDS(
  file = url(
    paste0(website, folder, "PSM_MCMC.rds")
  )
)
SM_MCMC <- readRDS(
  file = url(
    paste0(website, folder, "SM_MCMC.rds")
  )
)

6.6 Validating the model fit

Next, we visualise the results of the MCMC sampler by visualizing the different chains and by plotting the regression trend. To assess whether the chains converged, we will visualise trace plots for some of the parameters. The plot function allows us to generate trace, posterior density, and autocorrelation plots for different parameters.

plot(PSM_MCMC, Param = "mu", Gene = 120)

plot(PSM_MCMC, Param = "delta", Gene = 10)

plot(PSM_MCMC, Param = "epsilon", Gene = 20)

plot(PSM_MCMC, Param = "s", Cell = 90)

plot(PSM_MCMC, Param = "nu", Cell = 100)

plot(PSM_MCMC, Param = "beta", RegressionTerm = 2)

plot(PSM_MCMC, Param = "sigma2", Column = 1)

plot(SM_MCMC, Param = "mu", Gene = 120)

plot(SM_MCMC, Param = "delta", Gene = 10)

plot(SM_MCMC, Param = "epsilon", Gene = 20)

plot(SM_MCMC, Param = "s", Cell = 90)

plot(SM_MCMC, Param = "nu", Cell = 100)

plot(SM_MCMC, Param = "beta", RegressionTerm = 2)

plot(SM_MCMC, Param = "sigma2")

We observe that the chains for all chosen parameters appear to have converged. Furthermore, to validate that the model fitted the mean-variability trend correctly, we plot posterior estimates for over-dispersion paramters \(\delta_i\) against posterior estimates of mean expression parameters \(\mu_i\). For this, the BASiCS_ShowFit function can be used.

BASiCS_ShowFit(PSM_MCMC)

BASiCS_ShowFit(SM_MCMC)

Both trends display similar behaviour which allows us to compare residual over-dispersion estimates.

6.7 Differential testing

Next, we test for changes in mean expression and expression variability between the somitic and pre-somitic mesoderm. First, we are interested in assessing global changes in expression variability between the two conditions. For this, we compare over-dispersion parameters \(\delta_i\) for genes that are similarly expressed in both conditions.

Droplet_Test_logFC0 <- BASiCS_TestDE(
  Chain1 = PSM_MCMC,
  Chain2 = SM_MCMC,
  EpsilonM = 0,
  GroupLabel1 = "PSM",
  GroupLabel2 = "SM",
  Plot = FALSE,
  PlotOffset = FALSE,
  CheckESS = TRUE,
  MinESS = 100
)
DropletTableDisp <- format(Droplet_Test_logFC0, Which = "Disp", Filter = FALSE)
not_excluded <- DropletTableDisp$ResultDiffDisp != "ExcludedFromTesting"
for_plot <- DropletTableDisp[not_excluded, c("Disp1", "Disp2")]
for_plot <- reshape2::melt(for_plot)
ggplot(for_plot) + geom_boxplot(aes(variable, log(value))) +
  scale_x_discrete(labels = c("PSM", "SM")) +
  ylab("log(delta)") + xlab("")

6.7.1 Global changes in variability

With this analysis, we do not detect global changes in expression variability. We next profile changes in mean expression and expression variability on a gene-specific level. For this, we use a log2 fold change threshold of 1 for mean expression testing and the default threshold of \(\psi_0\approx0.41\) for differential variability testing.

Droplet_Test <- BASiCS_TestDE(
  Chain1 = PSM_MCMC,
  Chain2 = SM_MCMC,
  EpsilonM = 1,
  GroupLabel1 = "PSM",
  GroupLabel2 = "SM",
  Plot = FALSE,
  PlotOffset = FALSE,
  CheckESS = TRUE,
  MinESS = 100
)


colour_map_sm <- c(
  "PSM+" = "dark blue",
  "SM+" = "dark red",
  "NoDiff" = "black",
  "ExcludedLowESS" = "grey60",
  "ExcludedByUser" = "grey80"
)

BASiCS_PlotDE(Droplet_Test, Parameter = "Mean", Plot = "MA")

Droplet_TableMean <- format(Droplet_Test, Which = "Mean", Filter = FALSE)
# Differential expression
ggplot(Droplet_TableMean) +
  geom_point(
    aes(log(MeanOverall), MeanLog2FC, colour = ResultDiffMean),
    shape = 16,
    alpha = 0.5
  ) +
  scale_colour_manual(
    name = "Differential\nexpression",
    values = colour_map_sm
  ) +
  ylab(expression(mu[PSM] / mu[SM])) + xlab(expression(log(mu)))

BASiCS_PlotDE(Droplet_Test, Parameter = "ResDisp", Plot = "MA")

Droplet_TableResDisp <- format(Droplet_Test, Which = "ResDisp", Filter = FALSE)
# Differential variability
ggplot(Droplet_TableResDisp) +
  geom_point(
    aes(log(MeanOverall), ResDispDistance, colour = ResultDiffResDisp),
    shape = 16,
    alpha = 0.5
  ) +
  scale_colour_manual(
    name = "Differential\nvariability",
    values = colour_map_sm
  ) +
  ylab(expression(epsilon[PSM] - epsilon[SM])) + xlab(expression(log(mu)))

6.7.2 Differential mean expression

We can now list the genes that were detected as differentially expressed and differentially variable ordered by their difference in mean expression/variability. We first focus on genes that are differentially expressed between the two cell types.

# Highly expressed in somitic mesoderm
ind_sm <- Droplet_TableMean$ResultDiffMean == "SM+"
SM_mean <- Droplet_TableMean[ind_sm, ]
SM_mean <- SM_mean[order(SM_mean$MeanLog2FC, decreasing = FALSE), ]
SM_mean$Symbol <- genenames[SM_mean$GeneName, 2]

# Highly expressed in pre-somitic mesoderm
ind_psm <- Droplet_TableMean$ResultDiffMean == "PSM+"
PSM_mean <- Droplet_TableMean[ind_psm, ]
PSM_mean <- PSM_mean[order(PSM_mean$MeanLog2FC, decreasing = TRUE), ]
PSM_mean$Symbol <- genenames[PSM_mean$GeneName, 2]

We can next perform GO analysis on up- or down-regulated genes. First, we will perform GO analysis on somitic mesoderm specific genes. To do this, we will use the goseq.

library("goseq")
# Collect significan genes as 1 and all other as 0
SM_genes <- as.integer(Droplet_TableMean$ResultDiffMean == "SM+")
names(SM_genes) <- Droplet_TableMean$GeneName

# Build a Null distribution by correcting the gene length bias
pwf <- nullp(SM_genes, "mm10", "ensGene", bias.data = genelength[names(SM_genes)])

GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
SM_GO <- DataFrame(GO_wall[ind_signif, ])

# Add genenames to the GO categories
all_genes <- vector(length = nrow(SM_GO))
for (j in 1:nrow(SM_GO)) {
  allegs <- get(SM_GO$category[j], org.Mm.eg.db::org.Mm.egGO2ALLEGS)
  genes <- unique(unlist(mget(allegs, org.Mm.eg.db::org.Mm.egENSEMBL)))
  genes <- as.character(intersect(genes, SM_mean$GeneName))
  all_genes[j] <- paste(genes, collapse = ", ")
}
SM_GO$Genes <- all_genes

Now, we perform GO analysis on pre-somitic mesoderm specific genes

# Collect significan genes as 1 and all other as 0
PSM_genes <- as.integer(Droplet_TableMean$ResultDiffMean == "PSM+")
names(PSM_genes) <- Droplet_TableMean$GeneName

# Build a Null distribution by correcting the gene length bias
pwf <- nullp(
  PSM_genes,
  "mm10",
  "ensGene",
  bias.data = genelength[names(PSM_genes)]
)

GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
PSM_GO <- DataFrame(GO_wall[ind_signif, ])

# Add genenames to the GO categories
all_genes <- vector(length = nrow(PSM_GO))
for (j in 1:nrow(PSM_GO)) {
  allegs <- get(PSM_GO$category[j], org.Mm.eg.db::org.Mm.egGO2ALLEGS)
  genes <- unique(unlist(mget(allegs, org.Mm.eg.db::org.Mm.egENSEMBL)))
  genes <- as.character(intersect(genes, PSM_mean$GeneName))
  all_genes[j] <- paste(genes, collapse = ", ")
}
PSM_GO$Genes <- all_genes

To visualise the expression of individual genes, we can use the scater package.

celltype_color <- c(SM = "coral4", PSM = "limegreen")
# Expression of Fgf8 in both conditions
ind_fgf <- genenames$external_gene_name == "Fgf8"
plotExpression(droplet_sce,
  features = genenames[ind_fgf, 1],
  x = "Cell_type", colour_by = "Cell_type"
) + scale_fill_manual(name = "Cell type", values = celltype_color)

plotReducedDim(droplet_sce,
  dimred = "PCA",
  colour_by = genenames[ind_fgf, 1]
)

We can also visualise one GO category in form of heatmap. We use ComplexHeatmap to visualise the expression level for each cell of the genes annotated with GO terms (Gu, Eils, and Schlesner 2016).

library("ComplexHeatmap")
genes <- unlist(strsplit(PSM_GO[1, "Genes"], ", "))
for_heatmap <- logcounts(droplet_sce)[genes, ]
colnames(for_heatmap) <- colData(droplet_sce)$cell

# Order cells by cell type
for_heatmap <- for_heatmap[, order(colnames(for_heatmap))]

# Order rows by log2FC
heatmap_ind <- match(rownames(for_heatmap), PSM_mean$GeneName)
heatmap_order <- order(PSM_mean[heatmap_ind, "MeanLog2FC"], decreasing = TRUE)
for_heatmap <- for_heatmap[heatmap_order, ]
rownames(for_heatmap) <- genenames[rownames(for_heatmap), 2]

celltypes <- sub("_.*", "", colnames(for_heatmap))
celltypes <- sub("presomiticMesoderm.a", "PSM", celltypes)
celltypes <- sub("somiticMesoderm", "SM", celltypes)

h <- Heatmap(as.matrix(for_heatmap),
  cluster_columns = FALSE,
  show_column_names = FALSE,
  cluster_rows = FALSE,
  # col = colorRampPalette(c("#053061", "#1B5484", "white", "#932631", "#67001f"))(100),
  col = viridis(100),
  row_names_gp = gpar(fontsize = 7),
  name = "\n\nNormalised\nexpression",
  top_annotation = columnAnnotation(
    `Cell type` = celltypes,
    col = list(`Cell type` = celltype_color)
  )
)
draw(h, merge_legend = TRUE)

6.7.3 Differential variability testing

Next, we are interested in genes that change in variability between the two cell types. To help interpretion of the result, we will split the genes into four categories. These include:

  • More variable in SM, highly expressed in SM
  • More variable in SM, lowly expressed in SM
  • More variable in PSM, highly expressed in PSM
  • More variable in PSM, lowly expressed in SM
gene_groups <- data.frame(
  Genename = Droplet_TableResDisp$GeneName,
  Symbol = genenames[Droplet_TableResDisp$GeneName, 2],
  MeanLog2FC = Droplet_TableMean$MeanLog2FC,
  ResDispDistance = Droplet_TableResDisp$ResDispDistance,
  Regulation = paste(
    Droplet_TableMean$ResultDiffMean,
    Droplet_TableResDisp$ResultDiffResDisp,
    sep = "_"
  )
)

ind_reg <- gene_groups$Regulation %in% c("SM+_SM+", "SM+_PSM+", "PSM+_PSM+", "PSM+_SM+")
gene_groups <- gene_groups[ind_reg, ]

knitr::kable(head(gene_groups[gene_groups$Regulation == "SM+_SM+", ]))
Genename Symbol MeanLog2FC ResDispDistance Regulation
39 ENSMUSG00000000567 Sox9 -1.729 -2.593 SM+_SM+
51 ENSMUSG00000000753 Serpinf1 -2.493 -2.211 SM+_SM+
256 ENSMUSG00000002944 Cd36 -3.484 -1.624 SM+_SM+
393 ENSMUSG00000004665 Cnn2 -1.657 -1.028 SM+_SM+
484 ENSMUSG00000005836 Gata6 -3.345 -3.975 SM+_SM+
518 ENSMUSG00000006311 Etv2 -1.299 -3.290 SM+_SM+

We can visualise individual genes using scater.

# Expression of Meox2 in both conditions
ind_meox <- genenames$external_gene_name == "Meox2"
plotExpression(droplet_sce,
  features = genenames[ind_meox, 1],
  x = "Cell_type", colour_by = "Cell_type"
) + scale_fill_manual(name = "Cell type", values = celltype_color)

plotReducedDim(droplet_sce,
  dimred = "PCA",
  colour_by = genenames[ind_meox, 1]
)

7 Discussion

This section is required if the paper does not include novel data or analyses. It allows authors to briefly summarize the key points from the article.

8 Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.0 (2020-04-24)
##  os       Ubuntu 18.04.4 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_GB:en                    
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2020-05-15                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package              * version    date       lib
##  AnnotationDbi        * 1.50.0     2020-04-27 [1]
##  askpass                1.1        2019-01-13 [1]
##  assertthat             0.2.1      2019-03-21 [1]
##  backports              1.1.7      2020-05-13 [1]
##  BASiCS               * 1.99.1     2020-05-15 [1]
##  beeswarm               0.2.3      2016-04-25 [1]
##  BiasedUrn            * 1.07       2015-12-28 [1]
##  Biobase              * 2.48.0     2020-04-27 [1]
##  BiocFileCache          1.11.6     2020-04-16 [1]
##  BiocGenerics         * 0.34.0     2020-04-27 [1]
##  BiocManager            1.30.10    2019-11-16 [1]
##  BiocNeighbors          1.6.0      2020-04-27 [1]
##  BiocParallel           1.22.0     2020-04-27 [1]
##  BiocSingular           1.4.0      2020-04-27 [1]
##  BiocStyle            * 2.15.8     2020-04-19 [1]
##  biomaRt                2.43.6     2020-04-21 [1]
##  Biostrings             2.55.7     2020-03-24 [1]
##  bit                    1.1-15.2   2020-02-10 [1]
##  bit64                  0.9-7      2017-05-08 [1]
##  bitops                 1.0-6      2013-08-17 [1]
##  blob                   1.2.1      2020-01-20 [1]
##  bookdown               0.18       2020-03-05 [1]
##  callr                  3.4.3      2020-03-28 [1]
##  circlize               0.4.9      2020-04-30 [1]
##  cli                    2.0.2      2020-02-28 [1]
##  clue                   0.3-57     2019-02-25 [1]
##  cluster                2.1.0      2019-06-19 [1]
##  coda                   0.19-3     2019-07-05 [1]
##  colorspace             1.4-1      2019-03-18 [1]
##  ComplexHeatmap       * 2.3.5      2020-04-27 [1]
##  cowplot                1.0.0      2019-07-11 [1]
##  crayon                 1.3.4      2017-09-16 [1]
##  curl                   4.3        2019-12-02 [1]
##  DBI                    1.1.0      2019-12-15 [1]
##  dbplyr                 1.4.3      2020-04-19 [1]
##  DelayedArray         * 0.14.0     2020-04-27 [1]
##  DelayedMatrixStats     1.10.0     2020-04-27 [1]
##  desc                   1.2.0      2018-05-01 [1]
##  devtools               2.3.0      2020-04-10 [1]
##  digest                 0.6.25     2020-02-23 [1]
##  dplyr                  0.8.5      2020-03-07 [1]
##  dqrng                  0.2.1      2019-05-17 [1]
##  edgeR                  3.30.0     2020-04-27 [1]
##  ellipsis               0.3.0      2019-09-20 [1]
##  evaluate               0.14       2019-05-28 [1]
##  fansi                  0.4.1      2020-01-08 [1]
##  farver                 2.0.3      2020-01-16 [1]
##  fastmap                1.0.1      2019-10-08 [1]
##  fs                     1.4.1      2020-04-04 [1]
##  geneLenDataBase      * 1.24.0     2020-05-07 [1]
##  GenomeInfoDb         * 1.24.0     2020-04-27 [1]
##  GenomeInfoDbData       1.2.3      2020-04-24 [1]
##  GenomicAlignments      1.23.2     2020-03-24 [1]
##  GenomicFeatures        1.39.7     2020-03-19 [1]
##  GenomicRanges        * 1.40.0     2020-04-27 [1]
##  GetoptLong             0.1.8      2020-01-08 [1]
##  ggbeeswarm             0.6.0      2017-08-07 [1]
##  ggExtra                0.9        2019-08-27 [1]
##  ggplot2              * 3.3.0.9000 2020-04-29 [1]
##  ggpointdensity       * 0.1.0      2019-08-28 [1]
##  GlobalOptions          0.1.1      2019-09-30 [1]
##  glue                   1.4.1      2020-05-13 [1]
##  GO.db                  3.10.0     2020-04-24 [1]
##  goseq                * 1.40.0     2020-04-27 [1]
##  gridExtra              2.3        2017-09-09 [1]
##  gtable                 0.3.0      2019-03-25 [1]
##  hexbin                 1.28.1     2020-02-03 [1]
##  highr                  0.8        2019-03-20 [1]
##  hms                    0.5.3      2020-01-08 [1]
##  htmltools              0.4.0      2019-10-04 [1]
##  httpuv                 1.5.2      2019-09-11 [1]
##  httr                   1.4.1      2019-08-05 [1]
##  igraph                 1.2.5      2020-03-19 [1]
##  IRanges              * 2.22.1     2020-04-28 [1]
##  irlba                  2.3.3      2019-02-05 [1]
##  KernSmooth             2.23-17    2020-04-26 [1]
##  knitr                * 1.28       2020-02-06 [1]
##  labeling               0.3        2014-08-23 [1]
##  later                  1.0.0      2019-10-04 [1]
##  lattice                0.20-41    2020-04-02 [1]
##  lifecycle              0.2.0      2020-03-06 [1]
##  limma                  3.44.1     2020-04-28 [1]
##  locfit                 1.5-9.4    2020-03-25 [1]
##  magick                 2.3        2020-01-24 [1]
##  magrittr               1.5        2014-11-22 [1]
##  MASS                   7.3-51.6   2020-04-26 [1]
##  Matrix                 1.2-18     2019-11-27 [1]
##  matrixStats          * 0.56.0     2020-03-13 [1]
##  memoise                1.1.0      2017-04-21 [1]
##  mgcv                   1.8-31     2019-11-09 [1]
##  mime                   0.9        2020-02-04 [1]
##  miniUI                 0.1.1.1    2018-05-18 [1]
##  munsell                0.5.0      2018-06-12 [1]
##  nlme                   3.1-147    2020-04-13 [1]
##  openssl                1.4.1      2019-07-18 [1]
##  org.Mm.eg.db         * 3.11.1     2020-05-15 [1]
##  pillar                 1.4.4      2020-05-05 [1]
##  pkgbuild               1.0.8      2020-05-07 [1]
##  pkgconfig              2.0.3      2019-09-22 [1]
##  pkgload                1.0.2      2018-10-29 [1]
##  plyr                   1.8.6      2020-03-03 [1]
##  png                    0.1-7      2013-12-03 [1]
##  prettyunits            1.1.1      2020-01-24 [1]
##  processx               3.4.2      2020-02-09 [1]
##  progress               1.2.2      2019-05-16 [1]
##  promises               1.1.0      2019-10-04 [1]
##  ps                     1.3.3      2020-05-08 [1]
##  purrr                  0.3.4      2020-04-17 [1]
##  R6                     2.4.1      2019-11-12 [1]
##  rappdirs               0.3.1      2016-03-28 [1]
##  RColorBrewer           1.1-2      2014-12-07 [1]
##  Rcpp                   1.0.4.6    2020-04-09 [1]
##  RCurl                  1.98-1.2   2020-04-18 [1]
##  remotes                2.1.1      2020-02-15 [1]
##  reshape2               1.4.4      2020-04-09 [1]
##  rjson                  0.2.20     2018-06-08 [1]
##  rlang                  0.4.6      2020-05-02 [1]
##  rmarkdown              2.1        2020-01-20 [1]
##  rprojroot              1.3-2      2018-01-03 [1]
##  Rsamtools              2.3.7      2020-03-18 [1]
##  RSQLite                2.2.0      2020-01-07 [1]
##  rsvd                   1.0.3      2020-02-17 [1]
##  rtracklayer            1.47.0     2019-10-29 [1]
##  S4Vectors            * 0.26.0     2020-04-27 [1]
##  scales                 1.1.1      2020-05-11 [1]
##  scater               * 1.16.0     2020-04-27 [1]
##  scran                * 1.16.0     2020-04-27 [1]
##  sessioninfo            1.1.1      2018-11-05 [1]
##  shape                  1.4.4      2018-02-07 [1]
##  shiny                  1.4.0.2    2020-03-13 [1]
##  SingleCellExperiment * 1.10.1     2020-04-28 [1]
##  statmod                1.4.34     2020-02-17 [1]
##  stringi                1.4.6      2020-02-17 [1]
##  stringr                1.4.0      2019-02-10 [1]
##  SummarizedExperiment * 1.18.1     2020-04-30 [1]
##  testthat               2.3.2      2020-03-02 [1]
##  tibble                 3.0.1      2020-04-20 [1]
##  tidyselect             1.0.0      2020-01-27 [1]
##  usethis                1.6.1      2020-04-29 [1]
##  vctrs                  0.3.0      2020-05-11 [1]
##  vipor                  0.4.5      2017-03-22 [1]
##  viridis              * 0.5.1      2018-03-29 [1]
##  viridisLite          * 0.3.0      2018-02-01 [1]
##  withr                  2.2.0      2020-04-20 [1]
##  xfun                   0.13       2020-04-13 [1]
##  XML                    3.99-0.3   2020-01-20 [1]
##  xtable                 1.8-4      2019-04-21 [1]
##  XVector                0.28.0     2020-04-27 [1]
##  yaml                   2.2.1      2020-02-01 [1]
##  zlibbioc               1.34.0     2020-04-27 [1]
##  source                                  
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Github (jokergoo/ComplexHeatmap@54d6802)
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  local                                   
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
##  CRAN (R 4.0.0)                          
##  Bioconductor                            
## 
## [1] /home/alan/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/local/lib/R/library

9 Software version

TODO: Versions of all main Bioconductor packages I think sessionInfo suffices.

10 Data availability

TODO: Links to Grun data and Ximenas data TODO: Links to MCMC chains

11 Software availability

TODO: Software: All software used in this workflow is available as part of Bioconductor X.Y TODO: The source code of this workflow is available from: YYY TODO: Link to Github release version, source code TODO: License: ask Aaron

This section will be generated by the Editorial Office before publication. Authors are asked to provide some initial information to assist the Editorial Office, as detailed below.

  1. URL link to where the software can be downloaded from or used by a non-coder (AUTHOR TO PROVIDE; optional)
  2. URL link to the author’s version control system repository containing the source code (AUTHOR TO PROVIDE; required)
  3. Link to source code as at time of publication (F1000Research TO GENERATE)
  4. Link to archived source code as at time of publication (F1000Research TO GENERATE)
  5. Software license (AUTHOR TO PROVIDE; required)

12 Author information

In order to give appropriate credit to each author of an article, the individual contributions of each author to the manuscript should be detailed in this section. We recommend using author initials and then stating briefly how they contributed.

13 Competing interests

‘No competing interests were disclosed’.

14 Grant information

Please state who funded the work discussed in this article, whether it is your employer, a grant funder etc. Please do not list funding that you have that is not relevant to this specific piece of research. For each funder, please state the funder’s name, the grant number where applicable, and the individual to whom the grant was assigned. If your work was not funded by any grants, please include the line: ‘The author(s) declared that no grants were involved in supporting this work.’

15 Acknowledgments

This section should acknowledge anyone who contributed to the research or the article but who does not qualify as an author based on the criteria provided earlier (e.g. someone or an organization that provided writing assistance). Please state how they contributed; authors should obtain permission to acknowledge from all those mentioned in the Acknowledgments section.

Please do not list grant funding in this section.

Amezquita, Robert A., Vince J. Carey, Lindsay N. Carpp, Ludwig Geistlinger, Aaron T. L. Lun, Federico Marini, Kevin Rue-Albrecht, et al. 2019. “Orchestrating Single-Cell Analysis with Bioconductor.” Preprint. Genomics. https://doi.org/10.1101/590562.

Antolović, Vlatka, Agnes Miermont, Adam M. Corrigan, and Jonathan R. Chubb. 2017. “Generation of Single-Cell Transcript Variability by Repression.” Current Biology 27: 1811–7. https://doi.org/10.1016/j.cub.2017.05.028.

Arriaga, Edgar A. 2009. “Determining Biological Noise via Single Cell Analysis.” Analytical and Bioanalytical Chemistry 393 (1): 73–80. https://doi.org/10.1007/s00216-008-2431-z.

Best, J Adam, Jamie Knell, Edward Yang, Viveka Mayya, Andrew Doedens, Michael L Dustin, and Ananda W Goldrath. 2013. “Transcriptional Insights into the CD8+ T Cell Response to Infection and Memory T Cell Formation.” Nature Immunology 14 (4): 404–12. https://doi.org/10.1038/ni.2536.

Boettiger, Carl. 2015. “An Introduction to Docker for Reproducible Research.” ACM SIGOPS Operating Systems Review 49 (1): 71–79. https://doi.org/10.1145/2723872.2723882.

Brennecke, Philip, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” https://doi.org/10.1038/nmeth.2645.

Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55. https://doi.org/10.1080/10618600.1998.10474787.

Buettner, Florian, Kedar N Natarajan, F Paolo Casale, Valentina Proserpio, Antonio Scialdone, Fabian J Theis, Sarah A Teichmann, John C Marioni, and Oliver Stegle. 2015. “Computational Analysis of Cell-to-Cell Heterogeneity in Single-Cell RNA-Sequencing Data Reveals Hidden Subpopulations of Cells.” Nature Biotechnology 33 (2): 155–60. https://doi.org/10.1038/nbt.3102.

Carroll, Raymond J. 1998. Measurement Error in Epidemiologic Studies. Chichester, UK: John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat05178.

Cowles, Mary Kathryn, and Bradley P. Carlin. 1996. “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review.” Journal of the American Statistical Association 91 (434): 883–904. https://doi.org/10.1080/01621459.1996.10476956.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40. https://doi.org/10.1093/bioinformatics/bti525.

Durinck, Steffen, Paul T Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package biomaRt.” Nature Protocols 4 (8): 1184–91. https://doi.org/10.1038/nprot.2009.97.

Eberwine, James, and Junhyong Kim. 2015. “Cellular Deconstruction: Finding Meaning in Individual Cell Variation.” Trends in Cell Biology 25 (10): 569–78. https://doi.org/10.1016/j.tcb.2015.07.004.

Eling, Nils, Michael D. Morgan, and John C. Marioni. 2019. “Challenges in Measuring and Understanding Biological Noise.” Nature Reviews Genetics 20 (9): 536–48. https://doi.org/10.1038/s41576-019-0130-6.

Eling, Nils, Arianne C. Richard, Sylvia Richardson, John C. Marioni, and Catalina A. Vallejos. 2017. “Robust expression variability testing reveals heterogeneous T cell responses.” bioRxiv, 237214. https://doi.org/10.1101/237214.

———. 2018. “Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data.” Cell Systems 7 (3): 1–11. https://doi.org/10.1016/j.cels.2018.06.011.

Elowitz, Michael B, Arnold J Levine, Eric D Siggia, and Peter S Swain. 2002. “Stochastic gene expression in a single cell.” Science 297 (5584): 1183–6. https://doi.org/10.1126/science.1070919.

External RNA Controls Consortium. 2005. “Proposed methods for testing and selecting the ERCC external RNA controls.” BMC Genomics 6 (June 2004): 150. https://doi.org/10.1186/1471-2164-6-150.

Faure, Andre J., Jörn M. Schmiedel, and Ben Lehner. 2017. “Systematic Analysis of the Determinants of Gene Expression Noise in Embryonic Stem Cells.” Cell Systems 5 (5): 471–84. https://doi.org/10.1016/j.cels.2017.10.003.

Finak, Greg, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K. Shalek, Chloe K. Slichter, et al. 2015. “MAST: A Flexible Statistical Framework for Assessing Transcriptional Changes and Characterizing Heterogeneity in Single-Cell RNA Sequencing Data.” Genome Biology 16 (1): 278. https://doi.org/10.1186/s13059-015-0844-5.

Fu, Wenxian, Ayla Ergun, Ting Lu, Jonathan A Hill, Sokol Haxhinasto, Marlys S Fassett, Roi Gazit, et al. 2012. “A Multiply Redundant Genetic Switch ’Locks in’ the Transcriptional Signature of Regulatory T Cells.” Nature Immunology 13 (10): 972–80. https://doi.org/10.1038/ni.2420.

Gelman, Andrew, John Carlin, Hal Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2014. Bayesian Data Analysis. CRC Press.

Grün, Dominic, Lennart Kester, and Alexander van Oudenaarden. 2014. “Validation of noise models for single-cell transcriptomics.” Nature Methods 11 (6): 637–40. https://doi.org/10.1038/nmeth.2930.

Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.

Haque, Ashraful, Jessica Engel, Sarah A. Teichmann, and Tapio Lönnberg. 2017. “A Practical Guide to Single-Cell RNA-Sequencing for Biomedical Research and Clinical Applications.” Genome Medicine 9 (1): 75. https://doi.org/10.1186/s13073-017-0467-4.

Ibarra-Soria, Ximena, Wajid Jawaid, Blanca Pijuan-Sala, Vasileios Ladopoulos, Antonio Scialdone, David J. Jörg, Richard C. V. Tyser, et al. 2018. “Defining murine organogenesis at single-cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation.” Nature Cell Biology 20 (2): 127–34. https://doi.org/10.1038/s41556-017-0013-z.

Ilicic, Tomislav, Jong Kyoung Kim, Aleksandra A Kolodziejczyk, Frederik Otzen Bagger, Davis James Mccarthy, John C Marioni, and Sarah A Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biology 17 (29): 1–15. https://doi.org/10.1186/s13059-016-0888-1.

Islam, Saiful, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, Maria Kasper, Peter Lönnerberg, and Sten Linnarsson. 2014. “Quantitative Single-Cell RNA-Seq with Unique Molecular Identifiers.” Nature Methods 11 (2): 163–66. https://doi.org/10.1038/nmeth.2772.

Iwamoto, Kazunari, Yuki Shindo, and Koichi Takahashi. 2016. “Modeling Cellular Noise Underlying Heterogeneous Cell Responses in the Epidermal Growth Factor Signaling Pathway.” PLoS Computational Biology 12 (11): 1–18. https://doi.org/10.1371/journal.pcbi.1005222.

Kernfeld, Eric M., Ryan M. J. Genga, Kashfia Neherin, Margaret E. Magaletta, Ping Xu, and René Maehr. 2018. “A Single-Cell Transcriptomic Atlas of Thymus Organogenesis Resolves Cell Types and Developmental Maturation.” Immunity 48: 1–13. https://doi.org/10.1016/j.immuni.2018.04.015.

Kharchenko, Peter V, Lev Silberstein, and David T Scadden. 2014. “Bayesian approach to single-cell differential expression analysis.” Nature Methods 11 (7): 740–2. https://doi.org/10.1038/nmeth.2967.

Kim, Beomseok, Eunmin Lee, and Jong Kyoung Kim. 2019. “Analysis of Technical and Biological Variabilityin Single-Cell RNA Sequencing.” In Computational Methods for Single-Cell Data Analysis, 1935:25–43. https://doi.org/10.1007/978-1-4939-9057-3.

Kiselev, Vladimir Yu, Tallulah S. Andrews, and Martin Hemberg. 2019. “Challenges in unsupervised clustering of single-cell RNA-seq data.” Nature Reviews Genetics 2018. https://doi.org/10.1038/s41576-018-0088-9.

Kiviet, Daniel J., Philippe Nghe, Noreen Walker, Sarah Boulineau, Vanda Sunderlikova, and Sander J. Tans. 2014. “Stochasticity of metabolism and growth at the single-cell level.” Nature 514 (7522): 376–79. https://doi.org/10.1038/nature13582.

Klein, Allon M., Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid Peshkin, David A. Weitz, and Marc W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5): 1187–1201. https://doi.org/10.1016/j.cell.2015.04.044.

Kolodziejczyk, Aleksandra A., Jong Kyoung Kim, Jason C. H. Tsang, Tomislav Ilicic, Johan Henriksson, Kedar N. Natarajan, Alex C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17: 471–85. https://doi.org/10.1016/j.stem.2015.09.011.

Lähnemann, David, Johannes Köster, Ewa Szczurek, Davis J. McCarthy, Stephanie C. Hicks, Mark D. Robinson, Catalina A. Vallejos, et al. 2020. “Eleven Grand Challenges in Single-Cell Data Science.” Genome Biology 21 (1): 31. https://doi.org/10.1186/s13059-020-1926-6.

Lin, Yingxin, Shila Ghazanfar, Dario Strbenac, Andy Wang, Ellis Patrick, David M Lin, Terence Speed, Jean Y H Yang, and Pengyi Yang. 2019. “Evaluating Stably Expressed Genes in Single Cells.” GigaScience 8 (9): giz106. https://doi.org/10.1093/gigascience/giz106.

Lopez, Romain, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. 2018. “Deep Generative Modeling for Single-Cell Transcriptomics.” Nature Methods 15 (12): 1053–8. https://doi.org/10.1038/s41592-018-0229-2.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Lun, Aaron T. L., Karsten Bach, and John C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biology 17 (1): 75. https://doi.org/10.1186/s13059-016-0947-7.

Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A step-by-step workflow for basic analyses of single-cell RNA-seq data.” F1000Research 5 (2122). https://doi.org/10.12688/f1000research.9501.1.

Macosko, Evan Z., Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14. https://doi.org/10.1016/j.cell.2015.05.002.

Martinez-Jimenez, Celia P., Nils Eling, Hung-Chang Chen, Catalina A Vallejos, Aleksandra A Kolodziejczyk, Frances Connor, Lovorka Stojic, et al. 2017. “Aging increases cell-to-cell transcriptional variability upon immune stimulation.” Science 1436: 1433–6. https://doi.org/10.1126/science.aah4115.

McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. 2017. “Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.

McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.

Mojtahedi, Mitra, Alexander Skupin, Joseph Zhou, Ivan G. Castaño, Rebecca Y. Y. Leong-Quong, Hannah Chang, Kalliopi Trachana, Alessandro Giuliani, and Sui Huang. 2016. “Cell Fate Decision as High-Dimensional Critical State Transition.” PLOS Biology 14 (12): e2000640. https://doi.org/10.1371/journal.pbio.2000640.

Morgan, Michael D., and John C. Marioni. 2018. “CpG island composition differences are a source of gene expression noise indicative of promoter responsiveness.” Genome Biology 19 (1): 1–13. https://doi.org/10.1186/s13059-018-1461-x.

Patange, Simona, Michelle Girvan, and Daniel R. Larson. 2018. “Single-cell systems biology: Probing the basic unit of information flow.” Current Opinion in Systems Biology 8: 7–15. https://doi.org/10.1016/j.coisb.2017.11.011.

Plummer, Martyn, Nicky Best, Kate Cowles, and Karen Vines. 2006. “CODA: Convergence Diagnosis and Output Analysis for Mcmc.” R News 6 (1): 7–11. http://oro.open.ac.uk/22547/.

Prakadan, Sanjay M., Alex K. Shalek, and David A. Weitz. 2017. “Scaling by shrinking: empowering single-cell ’omics’ with microfluidic devices.” Nature Reviews Genetics 18 (6): 345–61. https://doi.org/10.1038/nrg.2017.15.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47. https://doi.org/10.1093/nar/gkv007.

Roberts, Gareth O., and Jeffrey S. Rosenthal. 2009. “Examples of Adaptive MCMC.” Journal of Computational and Graphical Statistics 18 (2): 349–67. https://doi.org/10.1198/jcgs.2009.06134.

Saelens, Wouter, Robrecht Cannoodt, Helena Todorov, and Yvan Saeys. 2019. “A Comparison of Single-Cell Trajectory Inference Methods.” Nature Biotechnology 37 (5): 547–54. https://doi.org/10.1038/s41587-019-0071-9.

Shalek, Alex K, Rahul Satija, Joe Shuga, John J Trombetta, Dave Gennert, Diana Lu, Peilin Chen, et al. 2014. “Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.” Nature 510 (7505): 263–69. https://doi.org/10.1038/nature13437.

Soneson, Charlotte, and Mark D Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods 15 (4): 255–61. https://doi.org/10.1038/nmeth.4612.

Stegle, Oliver, Sarah a. Teichmann, and John C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nature Reviews Genetics 16 (3): 133–45. https://doi.org/10.1038/nrg3833.

Townes, F. William, Stephanie C. Hicks, Martin J. Aryee, and Rafael A. Irizarry. 2019. “Feature Selection and Dimension Reduction for Single-Cell RNA-Seq Based on a Multinomial Model.” Genome Biology 20 (1): 295. https://doi.org/10.1186/s13059-019-1861-6.

Vallejos, Catalina A., John C. Marioni, and Sylvia Richardson. 2015a. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLOS Computational Biology 11: e1004333. https://doi.org/10.1371/journal.pcbi.1004333.

———. 2015b. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLOS Computational Biology 11 (6): e1004333. https://doi.org/10.1371/journal.pcbi.1004333.

Vallejos, Catalina A, Sylvia Richardson, and John C Marioni. 2016. “Beyond comparisons of means: understanding changes in gene expression at the single-cell level.” Genome Biology 17 (70). https://doi.org/10.1101/035949.

Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. 2017. “Normalizing single-cell RNA sequencing data: challenges and opportunities.” Nature Methods 14 (6): 565–71. https://doi.org/10.1038/nmeth.4292.

Wills, Quin F, Kenneth J Livak, Alex J Tipping, Tariq Enver, Andrew J Goldson, Darren W Sexton, and Chris Holmes. 2013. “Single-Cell Gene Expression Analysis Reveals Genetic Associations Masked in Whole-Tissue Experiments.” Nature Biotechnology 31 (8): 748–52. https://doi.org/10.1038/nbt.2642.

Zhu, Jinfang, and William E. Paul. 2010. “Peripheral CD4+ T-Cell Differentiation Regulated by Networks of Cytokines and Transcription Factors: Transcription Factor Network in Th Cells.” Immunological Reviews 238 (1): 247–62. https://doi.org/10.1111/j.1600-065X.2010.00951.x.

Zopf, Cristopher J., Katie Quinn, Joshua Zeidman, and Narendra Maheshri. 2013. “Cell-Cycle Dependence of Transcription Dominates Noise in Gene Expression.” PLoS Computational Biology 9 (7): 1–12. https://doi.org/10.1371/journal.pcbi.1003161.